Abstract
Machine learning has achieved notable progress in malicious traffic detection, yet its effectiveness highly depends on data that are sufficiently large and reliably labeled. In practice, many datasets are produced by automated labeling pipelines, which inevitably introduce label noise and, in turn, undermine detection performance. Consequently, maintaining robust and generalizable detection under label noise has become a central challenge in network intrusion detection. Existing approaches often emphasize intrinsic model robustness. However, noise can reshape the distribution of hard examples and bias the optimization objective, which may yield unstable decision boundaries and further degrade performance. In this paper, we propose a data-centric relabeling framework
, comprising two components: Normal Sample Discovery (NSD) via graph propagation and Malicious Sample Screening (MSS) with dual networks. NSD proceeds in three steps: (1) confident-sample selection; (2) K-NN graph construction; and (3) label propagation. We first select high-confidence samples and assume their labels are correct, build a graph over all samples, and propagate labels from the confident subset to the full graph; samples that remain uncertain after propagation are forwarded to MSS for second-stage annotation. NSD aims to recover the majority of correctly labeled instances; these instances act as reliable anchors that guide MSS in labeling the remaining uncertain samples, thereby reducing label noise and stabilizing training. We evaluate
on CIC-IDS2017 and DoHBrw-2020. Under 40% label noise,
attains F1 scores of 0.81 and 0.98, respectively, yielding 17.39% and 11.36% relative improvements over state-of-the-art baselines.
Keywords: Network traffic intrusion detection, Label noise, Machine learning
Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing
Introduction
Machine learning-based network intrusion detection methods have been widely used to detect malicious traffic in various networks. Training these models with a large amount of real malicious traffic can ensure better generalization. High-quality training data is crucial for the effectiveness of supervised machine learning.
However, collecting high-quality training data is not an easy task. A typical method involves executing malware samples captured from the real world. Specifically, independent environments are created for each type of malware in honeypot and sandbox settings1. The malware samples are executed, and all the generated traffic is collected and labeled as malicious. However, most of the traffic generated during malware execution is actually normal, leading to the potential mislabeling of normal traffic as malicious. Additionally, intrusion detection systems are also used to distinguish and label the collected traffic, but mislabeling is also common. Moreover, the frequent occurrence of zero-day attacks poses a severe challenge to the accurate labeling of real traffic2. These attacks are beyond the knowledge scope of annotators and labeling systems, making it difficult to label the traffic correctly. Therefore, current labeling methods result in potential label noise in the collected datasets, leading to low-quality training sets.
Consequently, the current labeling pipeline inevitably introduces potential label noise into collected datasets, degrading the quality of the training set. Training deep neural networks (DNNs) with noisy labels is challenging and has motivated two major lines of research: robust loss design and sample selection. Robust losses (e.g., Mean Absolute Error, MAE, and Generalized Cross-Entropy, GCE) modify the loss computation to mitigate the impact of noisy labels; however, under complex high-noise conditions with heterogeneous noise patterns, their effectiveness diminishes. A second line leverages transition-matrix–based methods that estimate mislabeling probabilities to correct losses, but the reliance on accurate matrix estimation and priors limits applicability. Sample-selection strategies exploit the “easy-first” learning dynamics of DNNs, prioritizing low-loss/high-confidence samples for training. Representative approaches include Co-teaching3, Co-teaching+4, and INCV5, which use dual networks to filter “clean” samples. Nevertheless, on CIC-IDS2017 (where malicious traffic is far rarer than normal), when the label noise reaches 40% (random flips), these methods exhibit an accuracy drop of about 10 percentage points, underscoring the need to balance robustness and adaptability in noisy and imbalanced settings.
Machine-learning–based network intrusion detection has evolved from shallow, feature-engineering–driven models to a landscape where deep learning and ensemble learning progress in parallel. Early deep-learning approaches were predominantly supervised, emphasizing end-to-end spatiotemporal representations and real-time performance: representative work includes HAST-IDS6 (hierarchical spatiotemporal learning), LuNET7 (hierarchical CNN–RNN collaboration), Pelican8 (residual connections to mitigate deep-network degradation), and CNN-BiLSTM9. These methods perform robustly on majority classes (e.g., Normal/DoS/Probe) across NSL-KDD, KDD’99, and CICIDS2017; more recent studies such as MEMCAIN10 and FA-CNN11 address class imbalance, yet they do not systematically consider noisy labels in training data under realistic conditions. As illustrated in Fig. 1, panel (A) Under Normal Circumstances depicts the decision boundary learned from clean labels, whereas (B) Under Noisy Labels shows a pronounced boundary shift that degrades detection performance.
Fig. 1.

The decision boundary changes in the presence of label noise.
To address the performance degradation of deep neural networks (DNNs) caused by label noise in complex environments, we propose
. The core idea is that traffic samples of the same category with similar behavior patterns exhibit high similarity in the feature space (e.g., learned embedding vectors) and cluster closely in the decision space (e.g., classifier output probabilities). To mitigate label noise,
selects low-loss, high-confidence samples identified via loss thresholding based on DNNs’ tendency to learn simple patterns first, and then correctly labels and employs pseudo-labeling in a semi-supervised framework to infer labels for other samples within the same class.
consists of two modules:
Graph-Propagation Enhanced Normal Sample Discovery (NSD): All training data is first pre-trained to screen high-confidence samples, obtaining a small set of labeled instances. Subsequently, we construct a graph matrix for all training samples using KNN, and perform semi-supervised graph propagation with the initially labeled samples to derive label probabilities for all instances. These probabilities are then filtered to obtain reliable normal samples.
Dual-Network Malicious Sample Screening (MSS): To address the scarcity of malicious samples in practical scenarios and the issue of low-confidence samples generated during graph propagation, we implement an enhanced dual-network co-learning approach on these low-confidence samples to accurately identify purified malicious instances.
We evaluate
on CIC-IDS2017 and DoHBrw-2020. Under 40% label noise, it attains F1 scores of 0.81 and 0.98, respectively, yielding statistically significant relative improvements of up to 17.39% and 11.36% over state-of-the-art noise-robust baselines (p < 0.05 in most comparisons, Wilcoxon signed-rank test). Compared to a vanilla DNN (weak baseline under high noise), the gains reach 62% and 48%, respectively.
Contributions. Our contributions are as follows:
We developed a system named
to detect malicious traffic in network streams, addressing the challenge of label noise in the training data. The system improves the accuracy of sample labels by re-labeling, which in turn enhances the model’s performance in Network Intrusion Detection (NID).We propose a novel module, Graph-Propagation Enhanced Normal Sample Discovery (NSD), which primarily leverages the high-confidence samples we select for label propagation within the network traffic graph. This process identifies some clean samples. For the remaining low-confidence samples, we apply Dual-Network Malicious Sample Screening (MSS) to perform label annotation. This approach significantly reduces the ratio of label noise in the dataset.
In our experiment, we simulated label noise scenarios based on realistic data distributions. We compared
with current state-of-the-art (SOTA) methods on two publicly available datasets, CIC-IDS2017 and DOHBRW-2020. The results showed significant improvements. With 40% label noise,
improved performance by 17.39% and 11.36% on the two datasets, respectively.
Related work
Noisy labels handling
Label noise handling aims to prevent supervised neural networks from overfitting to noisy labels. These methods fall into two categories: training robust recognition models and sample selection.
Robust machine learning models
Designing robust loss functions can reduce the impact of label noise on models. Many such functions12–15 are theoretically sound and perform well in some cases, but often fall short in practical, complex situations16. Alternatively, models17–20 using label transition matrices, which record the probability of mislabeling between classes, aim to correct loss values. However, their robustness relies heavily on accurately predicted transition matrices. If these matrices are inaccurately estimated, the corrected loss function may fail. However, obtaining accurate matrices requires prior knowledge, limiting these methods’ applicability.
Sample selection
Recent studies propose selecting high-confidence samples from noisy datasets3,4,21. During training, DNNs first learn simple patterns, resulting in lower loss values for correctly labeled samples, which are considered high-confidence. Han et al3. and Yu et al4. train two networks with the same structure but different initial parameters, allowing them to select samples for each other. They identify low-loss samples as high-confidence. Chen et al5. quantitatively analyzes the relationship between noise ratio and test accuracy and uses cross-validation to identify correct labels, enhancing model performance against noisy labels. Yuan et al22. unifies sample selection and robust training into a single framework, avoiding the information loss in individual methods and improving overall performance. However, methods such as those by Yuan et al. still underperform on noisy network traffic datasets with complex traffic categories.
Label propagation
Label Propagation is a graph-based semi-supervised learning method widely applied in network analysis, social media analysis, and cybersecurity. Initially proposed by Zhu and Ghahramani23, it leverages graph structures to propagate labels from labeled nodes to unlabeled ones based on node similarity. The algorithm assumes that adjacent nodes tend to share the same label, relying on principles of local and global consistency. Their classical approach constructs an adjacency graph and iteratively updates label probabilities, suitable for small-scale datasets.
Subsequent works enhanced Label Propagation. Zhou et al24. introduced a regularized framework, balancing local similarity and global label distribution to improve robustness. Wang and Zhang25proposed dynamic label propagation, allowing adaptive label updates for complex network structures. In cybersecurity, Label Propagation has been applied to anomaly detection and malicious traffic classification. For instance, Duan et al26. utilized it for network intrusion detection by constructing traffic feature graphs to propagate labels from known malicious nodes, effectively addressing unknown attack types. Recent advancements integrate deep learning, such as Graph Neural Networks (GNNs)27, enhancing Label Propagation with multi-layer graph convolutions. Despite its effectiveness in semi-supervised settings, Label Propagation’s performance depends on graph quality and initial label accuracy.
Architecture Of
aims to select high-quality, correctly labeled samples from a noisy training set and propagate them to construct a clean dataset, as illustrated in Fig. 2. The
framework consists of two key modules:
Graph-Propagation Enhanced Normal Sample Discovery (NSD): This module projects traffic features into a low-dimensional space to identify reliable samples. By leveraging feature distribution similarities among traffic data of the same class, it applies graph propagation to obtain a clean subset of labeled samples.
Dual-Network Malicious Sample Screening (MSS): Given the scarcity of malicious samples in real-world scenarios, where initial labels may be noisy, this module adapts the co-teaching algorithm for network intrusion detection, enabling more accurate refinement and selection of malicious instances.
Fig. 2.
The overview of our malicious traffic detection system
.
Problem statement
In real-world scenarios, collecting and labeling network traffic to build training datasets inevitably introduces label noise. Moreover, the significant imbalance between the abundant normal traffic samples and the scarce malicious traffic samples further degrades the detection performance of trained models. To address these challenges, this paper proposes a network intrusion detection framework that selects a high-quality and clean dataset from noisy training data, thereby mitigating the adverse impact of label noise and improving detection effectiveness.
We consider a
-class network traffic dataset with label noise. Let
denote a network traffic sample, where
represents the traffic features and
. Here,
corresponds to normal traffic, while other values indicate different types of malicious traffic. In our intrusion detection framework, the training dataset is defined as
, where
may be an incorrect label due to noise, and
denotes the number of training samples. The test dataset is given by
, where
is the total number of test samples. Our objective is to leverage
to accurately predict the true labels of
.
Graph-propagation enhanced normal sample discovery (NSD)
Graph-Propagation Enhanced Normal Sample Discovery(NSD) consists of three steps: selecting a subset of high-confidence samples, constructing a training sample graph matrix, and propagating labels through the graph to obtain labels for other samples. The goal of this method is to leverage the selected high-confidence samples and propagate their labels through the constructed graph matrix to generate high-quality labels.
Selection of confidence samples
This step primarily focuses on selecting high-confidence samples (i.e., correctly labeled samples) using cross-entropy loss values. In the next step, these high-confidence samples are used to infer the labels of other samples, improving the overall accuracy of the labeling process.
First, using traffic feature extraction tools (such as Bro, Wireshark, or other specialized software), detailed traffic features are extracted from network traffic packets (e.g., pcap files). For instance, in the CIC-IDS2017 dataset, CICFlowMeter is used for network traffic analysis to derive the relevant features.
The selection of high-confidence samples is based on the loss value, similar to previous research3, where samples with lower loss values typically have correct labels. Following this approach, we identify the samples most likely to be labeled correctly. Specifically, our training set is denoted as
, where
represents the features from traffic analysis and
represents the given labels.
Initially, we perform five epochs of pre-training and then use the updated model to calculate the cross-entropy loss value
for each data sample.
![]() |
1 |
For both normal and malicious samples, we select the top
% of samples with the lowest loss values as high-confidence samples
. The remaining samples are considered unlabeled
, where
.
KNN construction graph matrix
This step employs the KNN algorithm to construct a graph matrix from the network traffic data, which serves as the foundation for label propagation in the third step. To begin, we apply PCA to reduce the dimensionality of the data, projecting it onto its principal components. This reduction not only decreases computational complexity but also preserves the major variations in the original network traffic. Based on the reduced representation, we compute pairwise Euclidean distances between samples to identify their nearest neighbors.
Formally, given the training set
, we define the Euclidean distance between samples as
![]() |
2 |
For each sample, we then select its
nearest neighbors,
![]() |
3 |
Finally, we construct an undirected graph
, where the vertex set
corresponds to all samples, and the edge set
contains all neighbor connections, with
whenever
or
. The weight between nodes
and
is defined as a function of their distance 
We empirically set
=20% to select a sufficient number of high-confidence anchors while minimizing the risk of including noisy samples, consistent with small-loss retention ratios commonly used in noisy label literature (typically 10%–30%). Similarly,
=80% is chosen as a strict confidence threshold to filter reliable pseudo labels during propagation, preventing error amplification while retaining adequate coverage for downstream refinement – a practice aligned with confidence-based filtering in pseudo-labeling and graph-based SSL methods (commonly 70%–90%).
The raw features extracted by CICFlowMeter (or equivalent tools) include critical temporal patterns, such as Flow Duration (total active time per flow), Inter-Arrival Time statistics (mean, min, max, std of packet intervals in forward/backward directions), and Active/Idle time aggregates. These features naturally encode dynamic behaviors, e.g., burstiness in attack flows (short IATs) versus sustained normal sessions (longer, variable durations). In the KNN graph construction, temporal patterns are implicitly encoded through similarity in the feature space: flows with comparable temporal characteristics exhibit small Euclidean distances after PCA, leading to connected edges. Label propagation then leverages the graph smoothness assumption (adjacent nodes tend to share similar labels) to propagate clean anchor labels to temporally similar (potentially noisy) samples.
Label propagation on the graph
In the final step of NSD, we utilize the high-confidence samples and their labels obtained in the first step, together with the undirected graph constructed in the second step, to propagate labels across the dataset. During this propagation process, we apply an additional filtering mechanism to ensure high-quality samples. Specifically, only samples whose label confidence exceeds d% are assigned labels, while samples with lower confidence remain unassigned and are subjected to further refinement in the subsequent module.The three stages of our label propagation procedure are illustrated in Fig. 3. Stage 1 shows the initialization with labeled and unlabeled nodes; Stage 2 demonstrates the iterative propagation of label information; and Stage 3 highlights the confidence-based filtering, where only nodes above the threshold retain their propagated labels.
Fig. 3.
Label Propagation on the Graph.
Stage 1: Initialization and Propagation Operator Construction. we initialize a label matrix
, where
denotes the number of classes. For each labeled node
, we assign
![]() |
4 |
while unlabeled nodes are initialized as
![]() |
5 |
To construct the propagation operator, we first add self-loops to the adjacency matrix,
![]() |
6 |
and compute the degree matrix
![]() |
7 |
The normalized propagation matrix is then given by
![]() |
8 |
Stage 2: Iterative Label Propagation. Label propagation proceeds iteratively according to
![]() |
9 |
while the labels of the initially labeled nodes remain fixed:
![]() |
10 |
The iteration terminates once convergence is achieved, that is,
![]() |
11 |
or when the maximum number of iterations is reached.
Stage 3: Label Assignment and Confidence Filtering. After convergence, the predicted label of each node is determined by the index of the maximum probability in its label distribution,
![]() |
12 |
To ensure the reliability of pseudo-labels, we further introduce a confidence-based filtering rule. The confidence of node
is defined as
![]() |
13 |
A label is assigned only if
while nodes with lower confidence remain unassigned (denoted as
) and are refined in the subsequent module. The complete procedure of Graph-Propagation Enhanced Normal Sample Discovery (NSD) is summarized in Algorithm 1
Algorithm 1.
Graph-Propagation Enhanced Normal Sample Discovery (NSD)
Dual-network malicious sample screening (MSS)
This step focuses on labeling the remaining uncertain, unlabeled samples in the NSD dataset. We employ a co-teaching method to identify these samples, utilizing two neural networks with identical architectures but distinct parameters. Each network selects high-confidence samples while filtering out potentially noisy ones, passing the selected samples to the other network for further refinement. To address the challenges of real-world traffic scenarios, we optimize the process using the Mean Absolute Error (MAE) loss function, which mitigates the impact of data imbalance. The trained models then predict labels for the unselected samples, generating a subset of predicted samples denoted as
. The final labeled dataset is obtained by combining the NSD samples with the predicted subset, yielding
.
Specifically, we design two DNN models,
and
, which have the same structure and use the same loss function, MAE. These models are trained using the samples
obtained from the final propagation of NSD. During training, the outputs of
and
are interchanged in a cross-training fashion. In each training epoch
, we sample a mini-batch from the dataset. For each mini-batch, both networks perform forward propagation and select the subset of samples that minimizes the loss. These subsets are denoted as:
![]() |
14 |
![]() |
15 |
Where the MAE loss function is defined as:
![]() |
16 |
In the Dual-Network Malicious Sample Screening (MSS) module, we dynamically adjust the sample selection ratio R(t) over training epochs t to leverage the memorization effect of deep networks (clean samples are learned earlier). Assuming the noise rate
is known, we set:
![]() |
17 |
with
(warm-up epochs) and
(scaling by estimated noise level). This results in R(t) starting at 1.0 and linearly decaying to
after
epochs, then remaining constant. The linear decay ensures stable initial training while progressively excluding more noisy samples, consistent with curriculum learning in noisy-label methods (e.g., Co-teaching+4).The complete procedure of Dual-Network Malicious Sample Screening (MSS) is summarized in Algorithm 2
Algorithm 2.

Dual-Network Malicious Sample Screening (MSS)
Evaluation
We evaluate the proposed SilentSentinel detection system on widely adopted public benchmark datasets for network intrusion and malicious traffic detection. To assess its robustness under label noise, we compare it against representative state-of-the-art methods designed for noisy-label traffic classification, including Co-teaching, Co-teaching+, INCV, and MCRe. Table 1 summarizes the key hyperparameters of SilentSentinel.
Table 1.
Key Hyperparameters of SilentSentinel.
| Parameter | Value | Description |
|---|---|---|
| K (KNN graph) | 5 | Number of nearest neighbors for graph construction |
| Pre-training epochs | 5 | Epochs for initial warm-up to compute confidence |
(label propagation) |
10 | Maximum iterations for label propagation |
![]() |
10%−30% | Ratio of low-loss samples selected as confident set |
![]() |
70%−90% | Minimum confidence threshold for label assignment |
| R(t) (MSS dynamic ratio) | ![]() |
Dynamic selection ratio, = estimated noise rate |
| Learning rate | 0.001 | Adam optimizer learning rate |
| Batch size | 128 | Mini-batch size during training |
| DNN architecture | 3-layer MLP (100-100-32, ReLU) | Backbone network for pre-training and MSS |
| Optimizer | Adam | Optimization algorithm |
(MSS training) |
100 | Maximum epochs for Dual-Network screening |
Experiment setup
Public datasets
To validate the effectiveness of our method, we evaluated network intrusion detection performance on the CIC-IDS201728 and CIRA-CIC-DoHBrw-2020 (DoHBrw)29 datasets. The CIC-IDS2017 dataset comprises benign traffic and simulated real-world attacks categorized into 14 types, which we grouped into seven classes: DoS, Probe, DDoS, Brute Force, Web Attack, Botnet, and Infiltration. DoHBrw involves the implementation of DoH protocol across five different browsers/tools and four servers, capturing benign DoH, malicious DoH, and non-DoH traffic in applications. We cleaned the datasets by removing any NaN and infinite values.
Table 2.
Processed CICIDS-2017 dataset.
| Class | Instances |
|---|---|
| Benign | 2271320 |
| DoS Hulk | 230124 |
| PortScan | 158804 |
| DDos | 128025 |
| DoS GoldenEye | 10293 |
| FTP-Patator | 7935 |
| SSH-Patator | 5897 |
| DoS slowloris | 5796 |
| DoS Slowhttptest | 5499 |
| Bot | 1956 |
| Web Attack:Brute Force | 1507 |
| Web Attack:XSS | 652 |
| Infiltration | 36 |
| Web Attack:Sql Injection | 21 |
| Heartbleed | 11 |
Table 3.
Processed DoHBrw dataset.
| Category | Number |
|---|---|
| Non-DoH & Benign DoH | 917300(78.6%) |
| Malicious DoH | 249750(21.4%) |
| sum | 1167050 |
Noise labels
In our noise labels settings, we adopted a symmetric and asymmetric noise setup similar to MCRe22. Symmetric noise arises when labels have an equal probability of being incorrectly assigned to any other class. This can result in malicious traffic being mislabeled as benign and vice versa. The accuracy of labels is entirely dependent on the entity’s labeling precision, causing varying levels of noise.
In a symmetric noise scenario, every traffic label has a certain probability of being flipped. In this study, we set the noise label ratio range to [0.2, 0.4] to test the robustness of different methods under severe noise conditions. The Fig. 4 below shows the label transition matrix for a
noise ratio.Horizontal axis represents truth labels, and the vertical axis represents noisy labels.
Fig. 4.
Label transition matrix of asymmetric noise and symmetric noise.
Baselines
We compared
to five state-of-the-art (SOTA) techniques in related work that are designed to address sample label noise and data imbalances. To ensure the fairness of the comparison, we fine-tuned the relevant hyperparameters of each method according to the range of recommendations provided by the authors.
Metrics
We used two common metrics to evaluate the performance of our detection system: F1 and accuracy. We treat malicious and normal network traffic as positive and negative samples, respectively. Using the detection results and ground truth from a test set, we calculate the number of true positive samples (TP), false positive samples (FP), true negative samples (TN), and false negative samples (FN). The metrics are defined as follows:
,where
,
.
Overall performance
In this section, we evaluate the performance of our proposed framework on two public datasets, DoHBrw and CIC2017, under varying levels of label noise and real-world scenarios, comparing it against established baselines. The DoHBrw dataset contains network traffic data for detecting DNS-over-HTTPS traffic, while CIC2017 includes a diverse set of network intrusion scenarios, making both suitable for assessing robustness in realistic settings. To ensure a fair comparison, we fix the total number of training samples at
. Specifically, we randomly select malicious samples (
) and normal samples (
) based on a predefined class ratio derived from the dataset’s distribution. Subsequently, we introduce label noise by flipping the labels of selected samples according to a predefined label noise matrix, thereby generating the final training set. The test set is constructed as follows: it consists of all remaining samples in the original dataset after removing the 10,000 training samples (resulting in approximately
test samples, where
is the total size of the clean dataset). No label noise is introduced to the test set; it uses the original, noise-free ground-truth labels. For a fair comparison, we follow the label noise ratios used in Co-teaching (20%, 25%, 30%, 35%, 40%). We conducted five random experiments and reported the average F1 score as the performance evaluation result.
Asymmetric noise performance
Tables 4 and 5 present the F1-scores of all methods under different noise ratios with asymmetric noise settings, based on a training set of 10,000 samples. Our framework,
, consistently outperforms all baseline methods across all scenarios. For instance, at a noise ratio as high as 40%,
achieves an average F1-score of 0.98 on the DoHBrw dataset and 0.81 on the CIC-IDS2017 dataset. Compared to the vanilla DNN model, this represents an improvement of 48.48% and 62% on DoHBrw and CIC-IDS2017, respectively, and an improvement of 11.36% and 17.39% over the best baseline in related work. Most notably, variations in noise levels have minimal impact on our framework. Specifically, on the DoHBrw dataset, when the noise ratio increases from 20% to 40%, the performance of our method decreases by only 0.01, whereas the performance of other methods declines by more than 0.10. These results strongly demonstrate the significantly enhanced robustness of our system to label noise.
Table 4.
Performance (Avg ± Std) on the CIC-IDS2017 dataset under varying noise ratio (Asymmetric Noise Matrix).
| Method | Label Noise Ratio | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 30% | 35% | 40% | |||||||||||
| Metrics | Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
| Vanilla DNN | .70±.02 | .75±.02 | .73±.02 | .63±.02 | .74±.05 | .68±.05 | .40±.09 | .88±.09 | .56±.06 | .40±.09 | .76±.03 | .52±.04 | .40±.07 | .66±.09 | .50±.05 |
| Co-teaching | .81±.02 | .77±.04 | .79±.03 | .83±.03 | .69±.03 | .76±.02 | .75±.04 | .69±.03 | .72±.03 | .74±.05 | .59±.08 | .66±.06 | .54±.06 | .73±.06 | .62±.02 |
| Co-teaching+ | .98±.01 | .60±.01 | .74±.02 | .96±.03 | .64±.06 | .77±.05 | .95±.04 | .58±.04 | .72±.02 | .91±.03 | .56±.12 | .69±.09 | .90±.02 | .54±.06 | .67±.04 |
| INCV | .88±.01 | .65±.06 | .75±.04 | .85±.03 | .63±.03 | .73±.02 | .83±.04 | .62±.07 | .71±.04 | .73±.04 | .59±.01 | .68±.03 | .70±.03 | .57±.08 | .65±.05 |
| MCRe | .97±.01 | .65±.03 | .77±.02 | .94±.03 | .67±.04 | .76±.03 | .89±.03 | .64±.04 | .74±.03 | .87±.03 | .60±.04 | .71±.03 | .84±.05 | .59±.05 | .69±.03 |
(ours)
|
.82±.04 | .91±.04 | .87±.03 | .77±.02 | .95±.03 | .85±.02 | .77±.02 | .91±.03 | .83±.02 | .73±.04 | .95±.04 | .82±.02 | .72±.05 | .91±.02 | .81±.05 |
Table 5.
Performance (Avg ± Std) on the DoHBrw dataset under varying noise ratio (Asymmetric Noise Matrix).
| Method | Label Noise Ratio | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 30% | 35% | 40% | |||||||||||
| Metrics | Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
| Vanilla DNN | .91±.01 | .99±.00 | .94±.01 | .87±.02 | .97±.01 | .92±.01 | .77±.02 | .93±.01 | .85±.02 | .66±.02 | .92±.02 | .77±.03 | .53±.04 | .89±.03 | .66±.04 |
| Co-teaching | .97±.02 | .99±.00 | .98±.00 | .96±.03 | .99±.04 | .97±.01 | .95±.02 | .97±.01 | .96±.02 | .92±.03 | .96±.02 | .94±.03 | .85±.04 | .92±.03 | .88±.02 |
| Co-teaching+ | .99±.00 | .87±.08 | .93±.04 | .99±.00 | .84±.06 | .91±.03 | .89±.08 | .95±.02 | .93±.04 | .96±.00 | .82±.03 | .89±.02 | .85±.04 | .92±.03 | .88±.02 |
| INCV | .92±.01 | .88±.01 | .90±.01 | .84±.03 | .86±.02 | .85±.02 | .78±.02 | .82±.04 | .80±.02 | .76±.03 | .80±.03 | .78±.04 | .71±.03 | .80±.02 | .75±.01 |
| MCRe | .97±.01 | .96±.03 | .96±.01 | .95±.02 | .94±.03 | .94±.02 | .93±.04 | .92±.03 | .92±.02 | .90±.04 | .90±.03 | .90±.05 | .85±.05 | .88±.04 | .86±.03 |
(ours)
|
.97±.01 | .99±.00 | .98±.00 | .98±.01 | .99±.00 | .99±.00 | .98±.00 | .99±.00 | .99±.00 | .98±.01 | .97±.02 | .98±.01 | .98±.01 | .98±.01 | .98±.01 |
Symmetric noise performance
Tables 6 and 7 present the F1-scores of all methods under different noise ratios with symmetric noise settings, based on a training set of 10,000 samples. Our framework,
, consistently outperforms all baseline methods across all scenarios.
Table 6.
Performance (Avg ± Std) on the CIC-IDS2017 dataset under varying noise ratio (Symmetric Noise Matrix).
| Method | Label Noise Ratio | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 30% | 35% | 40% | |||||||||||
| Metrics | Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
| Vanilla DNN | .61±.02 | .83±.02 | .70±.02 | .58±.02 | .81±.05 | .67±.05 | .51±.09 | .80±.09 | .62±.06 | .44±.09 | .73±.03 | .55±.04 | .35±.05 | .74±.07 | .47±.05 |
| Co-teaching | .64±.03 | .98±.01 | .78±.05 | .63±.02 | .97±.01 | .76±.03 | .62±.03 | .97±.02 | .75±.02 | .61±.04 | .98±.01 | .75±.03 | .57±.06 | .98±.01 | .72±.02 |
| Co-teaching+ | .97±.01 | .61±.01 | .75±.02 | .95±.02 | .65±.05 | .76±.04 | .94±.03 | .59±.03 | .73±.02 | .90±.02 | .57±.10 | .70±.08 | .89±.02 | .55±.05 | .68±.04 |
| INCV | .89±.01 | .64±.05 | .74±.03 | .86±.02 | .62±.03 | .72±.02 | .84±.03 | .61±.06 | .70±.03 | .74±.03 | .58±.01 | .67±.03 | .71±.02 | .56±.07 | .64±.04 |
| MCRe | .96±.01 | .66±.03 | .78±.02 | .93±.02 | .68±.04 | .77±.03 | .90±.02 | .63±.04 | .73±.03 | .88±.02 | .61±.03 | .72±.02 | .85±.04 | .60±.04 | .70±.03 |
(ours)
|
.83±.02 | .86±.03 | .85±.02 | .72±.03 | .79±.02 | .83±.03 | .80±.04 | .84±.04 | .82±.03 | .72±.02 | .89±.03 | .80±.02 | .70±.04 | .95±.04 | .79±.02 |
Table 7.
Performance (Avg ± Std) on the DoHBrw dataset under varying noise ratio (Symmetric Noise Matrix).
| Method | Label Noise Ratio | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20% | 25% | 30% | 35% | 40% | |||||||||||
| Metrics | Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
Pre | Rec | F
|
| Vanilla DNN | .61±.01 | .82±.00 | .70±.01 | .58±.02 | .81±.01 | .68±.01 | .51±.02 | .80±.01 | .62±.02 | .44±.02 | .73±.02 | .55±.03 | .32±.04 | .65±.03 | .43±.04 |
| Co-teaching | .77±.02 | .97±.00 | .86±.00 | .80±.02 | .93±.01 | .85±.02 | .74±.03 | .94±.02 | .82±.03 | .70±.03 | .93±.04 | .80±.01 | .68±.04 | .93±.03 | .77±.02 |
| Co-teaching+ | .98±.01 | .86±.07 | .92±.04 | .98±.01 | .83±.05 | .90±.03 | .88±.07 | .92±.02 | .89±.03 | .95±.01 | .81±.03 | .88±.02 | .84±.04 | .91±.03 | .87±.02 |
| INCV | .91±.01 | .87±.01 | .89±.01 | .83±.03 | .85±.02 | .84±.02 | .77±.02 | .81±.04 | .79±.02 | .75±.03 | .79±.03 | .77±.03 | .70±.03 | .79±.02 | .74±.01 |
| MCRe | .96±.01 | .95±.03 | .95±.01 | .94±.02 | .93±.03 | .93±.02 | .92±.03 | .91±.03 | .91±.02 | .89±.04 | .89±.03 | .89±.04 | .84±.04 | .87±.04 | .85±.03 |
(ours)
|
.98±.01 | .99±.00 | .99±.00 | .98±.01 | .99±.00 | .99±.00 | .98±.00 | .99±.00 | .99±.00 | .98±.01 | .98±.01 | .98±.01 | .97±.01 | .97±.01 | .98±.01 |
Notably,
maintains a high performance level on the DoH dataset under both noise settings, achieving an average F1-score of 0.98. This represents an improvement of 127.90% over the vanilla DNN and a 12.64% improvement over the best state-of-the-art (SOTA) method. On the more diverse and complex CIC-IDS dataset,
experiences a slight performance decrease under symmetric noise; nevertheless, it still maintains the highest F1-score. This corresponds to an improvement of 68.08% over the vanilla DNN and a 9.72% improvement over the best SOTA baseline.
To validate the effectiveness under 30% symmetric noise, we conducted Wilcoxon signed-rank tests on both datasets, as reported in Table 8 and Table 9. On the CIC-IDS2017 dataset, the proposed method significantly outperforms Vanilla DNN, Co-teaching, and Co-teaching+ (p = 0.0312 < 0.05). It also shows a strong trend over INCV and MCRe, though the difference is marginally significant (p
0.063).On the DoHBrw dataset, our approach achieves statistical significance against all baselines (p = 0.0312 < 0.05 for each comparison). These results demonstrate that the proposed method exhibits more stable robustness and superior performance under high noise levels, with the advantage being particularly pronounced on the DoHBrw dataset.
Table 8.
Wilcoxon signed-rank test results based on 5 runs (30% symmetric noise, CIC-IDS2017 dataset).
| Baseline Method | W | p-value | Significant( =0.05) |
Remark |
|---|---|---|---|---|
| Vanilla DNN | 15 | 0.0312 | Yes | p<0.05 |
| Co-teaching+ | 15 | 0.0312 | Yes | p<0.05 |
| Co-teaching | 15 | 0.0312 | Yes | p<0.05 |
| INCV | 14 | 0.0625 | Marginally | p 0.063 > 0.05 |
| MCRe | 14 | 0.0625 | Marginally | p 0.063 > 0.05 |
Table 9.
Wilcoxon signed-rank test results based on 5 runs (30% symmetric noise, DoHBrw dataset).
| Baseline Method | W | p-value | Significant( =0.05) |
Remark |
|---|---|---|---|---|
| Vanilla DNN | 15 | 0.0312 | Yes | p<0.05 |
| Co-teaching+ | 15 | 0.0312 | Yes | p<0.05 |
| Co-teaching | 15 | 0.0312 | Yes | p<0.05 |
| INCV | 14 | 0.0312 | Yes | p<0.05 |
| MCRe | 14 | 0.0312 | Yes | p<0.05 |
Evaluating individual components
Graph-propagation enhanced normal sample discovery (NSD)
We conduct two ablation studies on NSD. In the first ablation, instead of selecting high-confidence samples, we randomly choose samples for subsequent propagation in the Selection of confidence samples module. In the second ablation, we remove
the Label Propagation on the Graph module and directly use the filtered high-confidence samples for further training, where labels are assigned to the low-confidence samples.
Selection of confidence samples. In this module, we randomly select 20% of the labels as high-confidence samples for subsequent Label propagation on the graph. As shown in Table 10, under a 30% label noise setting, the average F1 score decreases by 3.61% on the CIC-IDS2017 dataset and 3.03% on the DoHBrw dataset. This performance drop occurs because, without selecting truly high-confidence samples, the initial labels used for propagation are incorrect, which reduces the accuracy of the propagated labels. The performance degradation becomes more pronounced under higher noise ratios.
Table 10.
Performance comparison of
and its variants on the CIC-IDS2017 and DoHBrw datasets.
| Method | Metrics | CIC-IDS2017 | DoHBrw |
|---|---|---|---|
![]() |
P | .77±.04 | .98±.00 |
| R | .91±.04 | .99±.00 | |
| F | .83±.02 | .99±.00 | |
without NSD-Confidence filtering |
P | .75±.04 | .95±.01 |
| R | .88±.01 | .99±.00 | |
| F | .80±.02 | .96±.01 | |
without NSD-Label propagation |
P | .76±.01 | .95±.01 |
| R | .70±.03 | .97±.01 | |
| F | .74±.03 | .96±.01 | |
without MSS |
P | .70±.01 | .98±.01 |
| R | .90±.03 | .94±.02 | |
| F | .79±.01 | .95±.01 |
Label Propagation on the Graph. To validate the effectiveness of propagating labels via the graph, we conduct an ablation experiment by removing the KNN graph construction and label propagation modules. Instead, the high-confidence samples selected in the previous module are directly used to predict labels for the low-confidence samples. As shown in Table 10, under 30% label noise, the average F1 score decreases by 10.84% on CIC-IDS2017 and 3.03% on DoHBrw. This decline occurs because, without label propagation, the number of reliable high-confidence samples is insufficient. Consequently, the model’s predictive capability degrades during MSS training due to the limited amount of accurately labeled data, leading to reduced final performance.
Dual-network malicious sample screening (MSS)
We perform an ablation study on the MSS module by removing it and training the model directly using the data processed by the NSD module. Without MSS, unlabeled samples are excluded from the training process. As shown in Table 10, under a 30% label noise ratio, the average F1 score decreases by 4.82% on CIC-IDS2017 and 4.04% on DoHBrw. The limited decline is attributable to the significant improvement in label accuracy after propagation. However, the residual performance drop results from the reduced number of training samples after propagation, which affects the model’s overall capability.
Removing label noise
We compare
with two representative baselines under varying noise conditions by measuring the residual label noise ratio among the retained (cleaned) training samples. As shown in Fig. 5,
achieves the highest training-sample precision (i.e., the lowest residual noise) across all noise levels on CIC-IDS2017 and DoHBrw. Specifically, at 20% noise all methods perform similarly, whereas their performance diverges as noise increases.Under 40% noise: Asymmetric:
reduces the residual noise to 8.75% (CIC-IDS2017-Asymmetric) and 16.67% (DoHBrw-Asymmetric); Co-teaching attains 15.93% and 32.91%, and Co-teaching+ achieves 15.03% and 24.54%. Symmetric:
reduces the residual noise to 9.56% (CIC-IDS2017-Symmetric) and 18.91% (DoHBrw-Symmetric); Co-teaching yields 33.40% and 33.59%, and Co-teaching+ yields 24.76% and 27.56%.
Fig. 5.
Training Sample Accuracy of Various Methods under Different Label Noise Levels.
Since DoHBrw is designed for detecting and evaluating encrypted DNS over HTTPS (DoH) traffic, it contains fewer categories than CIC-IDS2017 and presents a lower detection difficulty. As all methods achieve high training sample accuracy on this dataset, the results are not included here. The strong performance of our method across different noise levels can be attributed to its two-stage NSD and MSS framework, which progressively improves training sample accuracy and enhances overall robustness, even under high-noise conditions.
Hyperparameter sensitivity
This section evaluates the impact of key hyperparameters on model performance across two datasets. Specifically, we examine the selection ratio
in the Selection of Confidence Samples phase of NSD, and the confidence filtering threshold
in its Label Propagation on the Graph module.
Under 30% noise conditions, we tested five different values of
(10%, 15%, 20%, 25%, and 30%) and five different values of
(70%, 75%, 80%, 85%, and 90%). The experimental results for different
and
settings are shown in Table 11 and Table 12. The results demonstrate that our method consistently achieves high performance across various configurations of
and
. For the CIC-IDS2017 dataset, the F1-score variation across different parameter values is less than 0.03, while for the simpler DoHBrw dataset, the F1-score remains consistently high. This consistency validates the robustness of our method in selecting high-confidence samples.
Table 11.
Performance metrics for CIC-IDS2017 and DoHBrw under different
selection ratios.
| Dataset | Metrics | Top
|
||||
|---|---|---|---|---|---|---|
| 10% | 15% | 20% | 25% | 30% | ||
| CIC-IDS2017 | P | .75±.03 | .79±.02 | .77±.02 | .80±.02 | .76±.03 |
| R | .92±.02 | .88±.03 | .91±.03 | .90±.03 | .89±.02 | |
| F | .81±.02 | .84±.02 | .83±.02 | .85±.03 | .82±.02 | |
| DoHBrw | P | .98±.01 | .99±.00 | .98±.00 | .99±.00 | .99±.00 |
| R | .98±.01 | .98±.00 | .99±.00 | .97±.00 | .98±.01 | |
| F | .98±.01 | .99±.00 | .99±.00 | .99±.00 | .98±.01 | |
Table 12.
Performance metrics for CIC-IDS2017 and DoHBrw datasets under different threshold
values.
| Dataset | Metrics | Threshold
|
||||
|---|---|---|---|---|---|---|
| 70% | 75% | 80% | 85% | 90% | ||
| CIC-IDS2017 | P | .74±.03 | .75±.02 | .77±.02 | .77±.04 | .76±.02 |
| R | .88±.02 | .89±.03 | .91±.03 | .87±.04 | .90±.01 | |
| F | .79±.03 | .81±.02 | .83±.02 | .80±.02 | .82±.01 | |
| DoHBrw | P | .98±.01 | .96±.01 | .98±.00 | .98±.00 | .96±.00 |
| R | .97±.01 | .98±.01 | .99±.00 | .99±.01 | .95±.01 | |
| F | .98±.01 | .97±.01 | .99±.00 | .99±.00 | .97±.01 | |
Computational cost analysis
As shown in Table 13, SilentSentinel incurs a moderate training overhead (38.52 seconds), which is approximately 1.8× and 2.3× that of vanilla DNN and Co-teaching family methods, respectively. This mainly stems from its two-stage preprocessing pipeline: NSD (KNN graph construction and label propagation) and MSS (dual-network co-teaching with dynamic instance selection).
Table 13.
Computational cost comparison. Training time includes all preprocessing stages (excluding data loading); inference time is forward-pass only (per epoch).
| Method | Training Times(/s) | Inference Time (/s) |
|---|---|---|
| Vanilla DNN | 21.53 | 0.453 |
| Co-teaching | 16.68 | 126 |
| Co-teaching+ | 20.23 | 0.202 |
| INCV | 181.54 | 0.135 |
| MCRe | 45.84 | 6.63 |
| SilentSentinel | 38.52 | 0.480 |
In contrast, INCV exhibits significantly higher training time (181.54 seconds), most likely due to its iterative cross-validation mechanism. Meanwhile, MCRe suffers from prohibitively high inference latency (6.63 seconds per batch), caused by online K-means clustering during inference, making it impractical for real-time NIDS deployment.
In typical Network Intrusion Detection System scenarios, model training is performed offline (e.g., periodic retraining on accumulated data), whereas inference must operate continuously on streaming traffic. Consequently, the additional training cost is generally acceptable – especially when substantial accuracy gains can be achieved in high-noise environments.
Discussion
Concept Drift.
is not inherently equipped to address concept drift, and its effectiveness may diminish when such drift occurs. To counteract this, one straightforward yet efficient strategy involves periodically refreshing the training dataset and retraining the entire model upon detecting notable declines in performance. Furthermore, established concept drift detection and adaptation techniques–such as those discussed in30 and CADE31–can be incorporated into
. These methods facilitate timely detection of distributional shifts and support subsequent data relabeling and model refinement.
Extreme Label Noise. Under conditions where label noise surpasses 50%, the performance of our framework is prone to deteriorate. This is mainly because the process of selecting high-confidence samples becomes increasingly error-prone: the model is likely to assign high confidence to mislabeled examples while overlooking correctly labeled ones. It is worth noting, however, that such extreme noise levels are uncommon in real-world settings. To maintain robustness, data preprocessing steps can be applied during collection to keep the noise ratio within acceptable bounds.
Future Work. Several directions are planned for further investigation. First, we intend to explore the integration of more advanced architectures–such as Transformer-based models32–into our framework. Second, while the current re-weighting module employs the top-performing strategy from our evaluations, alternative weighting schemes will be examined to enhance adaptability and performance. Lastly, we aim to deploy the system in operational environments for malicious traffic detection, assessing its practicality and resilience under more complex and dynamic real-world conditions.
Scalability and Large-Scale Applicability.All experiments in this study were conducted using a fixed training set of 10,000 samples, which is sufficient to demonstrate the effectiveness of the proposed method under controlled noisy-label conditions. To further assess scalability, we also evaluated training on a tenfold larger set of 100,000 samples, which took 641.98 seconds. In typical Network Intrusion Detection System scenarios, model training is performed offline, while inference runs continuously on streaming traffic. Consequently, this training duration remains practically acceptable.
To address potential scalability limitations at larger scales, we recommend the following practical approximations: (1) use approximate nearest-neighbor search via efficient libraries such as FAISS or HNSW to achieve sub-linear complexity in KNN graph construction; (2) employ sparse matrix operations for label propagation; and (3) adopt mini-batch graph sampling or incremental update strategies to handle streaming or very large datasets. Notably, the MSS module, which relies on dual-network co-teaching, is inherently scalable through standard mini-batching and GPU parallelism, akin to conventional deep neural network training.
Conclusion
We developed
, a system designed to enhance network traffic intrusion detection by addressing the challenge of label noise in training data. The framework consists of two key components: Graph-Propagation Enhanced Normal Sample Discovery (NSD) and Dual-Network Malicious Sample Screening (MSS). First, NSD selects high-confidence samples as label seeds, constructs a KNN graph over the entire dataset, and propagates labels to identify correctly labeled normal and malicious instances. Low-confidence samples are then processed by the MSS module for further screening, effectively reducing label noise through collaborative learning between the two modules. Extensive experiments on two public benchmarks, CIC-IDS2017 and DoHBrw-2020, demonstrate that
significantly outperforms state-of-the-art methods. Under 40% label noise, it achieves F1 scores of 0.81 and 0.98, corresponding to improvements of 17.39% and 11.36% over existing SOTA approaches.
Author contributions
Author Contributions. J.D. conceived and supervised the study. R.Z. and J.D. designed the overall framework. R.Z. implemented the NSD module and conducted data preprocessing. Q.D. implemented the MSS module and prepared Figures 1-3. H.C. set up experiments, conducted evaluations on CIC-IDS2017 and DoHBrw-2020, and prepared Figures 4-5 and Tables. R.Z., Q.D., and H.C. performed the experiments and analyzed the results. J.D. and R.Z. wrote the main manuscript text. All authors reviewed and approved the final manuscript. (J.D. is the corresponding author.)
Funding
This work was supported in part by the Key Program of Nature Science Foundation of Zhejiang Province under Grant LZ24F020007, in part by the National Nature Science Foundation of China under Grant 62072407, in part by the ”Leading Goose Project Plan” of Zhejiang Province under Grant 2022C01086, Grant 2022C03139, and in part by the National Key R&D Program of China under Grant 2022YFB2701400, Supported by the “Tianchi Talent ”Distinguished Expert Program of Xinjiang Province.
Data availability
The datasets analysed in this study are publicly available from the Canadian Institute for Cybersecurity (CIC). Specifically: CIC-IDS2017 (https://www.unb.ca/cic/datasets/ids-2017.html) and DoHBrw (https://www.unb.ca/cic/datasets/dohbrw-2020.html).For queries about data usage in this work or to request ancillary materials (e.g., data splits or preprocessing scripts), please contact the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Miramirkhani, N., Appini, M. P., Nikiforakis, N. & Polychronakis, M. Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts. In 2017 IEEE Symposium on Security and Privacy (SP), 1009–1024 (IEEE, 2017).
- 2.Zhang, J., Li, F., Ye, F. & Wu, H. Autonomous unknown-application filtering and labeling for dl-based traffic classifier update. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, 397–405 (IEEE, 2020).
- 3.Han, B. et al. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems31 (2018). [PMC free article] [PubMed]
- 4.Yu, X. et al. How does disagreement help generalization against label corruption? In International conference on machine learning, 7164–7173 (PMLR, 2019).
- 5.Chen, P., Liao, B. B., Chen, G. & Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. In International conference on machine learning, 1062–1070 (PMLR, 2019).
- 6.Wang, W. et al. Hast-ids: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE access6, 1792–1806 (2017). [Google Scholar]
- 7.Wu, P. & Guo, H. Lunet: a deep neural network for network intrusion detection. In 2019 IEEE symposium series on computational intelligence (SSCI), 617–624 (IEEE, 2019).
- 8.Wu, P., Guo, H. & Moustafa, N. Pelican: A deep residual network for network intrusion detection. In 2020 50th annual IEEE/IFIP international conference on dependable systems and networks workshops (DSN-W), 55–62 (IEEE, 2020).
- 9.Rhanoui, M., Mikram, M., Yousfi, S. & Barzali, S. A cnn-bilstm model for document-level sentiment analysis. Mach. Learn. Knowl. Extr.1, 832–847 (2019). [Google Scholar]
- 10.Liu, L. et al. Memcain: A memory-enhanced hybrid cnn-attention model for network anomaly detection. Sci. Rep.15, 34958 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Attack, W. et al. Ensemble of feature augmented convolutional neural network and deep autoencoder for efficient detection of network attacks. Sci. Rep.15, 4267 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ghosh, A., Kumar, H. & Sastry, P. S. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, vol. 31 (2017).
- 13.Zhang, Z. & Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. neural information processing systems31 (2018). [PMC free article] [PubMed]
- 14.Lyu, Y. & Tsang, I. W. Curriculum loss: Robust learning and generalization against label corruption. arXiv preprintarXiv:1905.10045 (2019).
- 15.Ma, X. et al. Normalized loss functions for deep learning with noisy labels. In International conference on machine learning, 6543–6553 (PMLR, 2020).
- 16.Song, H., Kim, M., Park, D., Shin, Y. & Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE transactions on neural networks and learning systems (2022). [DOI] [PubMed]
- 17.Hendrycks, D., Mazeika, M., Wilson, D. & Gimpel, K. Using trusted data to train deep networks on labels corrupted by severe noise. Adv. neural information processing systems31 (2018).
- 18.Xia, X. et al. Are anchor points really indispensable in label-noise learning? Adv. neural information processing systems32 (2019).
- 19.Yao, Y. et al. Dual t: Reducing estimation error for transition matrix in label-noise learning. Advances in neural information processing systems33, 7260–7271 (2020). [Google Scholar]
- 20.Wang, J., Wang, E. X. & Liu, Y. Estimating instance-dependent label-noise transition matrix using a deep neural network. In International Conference on Machine Learning (2022).
- 21.Jiang, L., Zhou, Z., Leung, T., Li, L.-J. & Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International conference on machine learning, 2304–2313 (PMLR, 2018).
- 22.Yuan, Q. et al. Mcre: A unified framework for handling malicious traffic with noise labels based on multidimensional constraint representation. IEEE Trans. Inf. Forensics Secur.19, 133–147 (2024). [Google Scholar]
- 23.Zhu, X. & Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. ProQuest number: information to all users (2002).
- 24.Zhou, D., Bousquet, O., Lal, T., Weston, J. & Schölkopf, B. Learning with local and global consistency. Advances in neural information processing systems16 (2003).
- 25.Wang, F. & Zhang, C. Label propagation through linear neighborhoods. In Proceedings of the 23rd international conference on Machine learning, 985–992 (2006).
- 26.Duan, G., Lv, H., Wang, H. & Feng, G. Application of a dynamic line graph neural network for intrusion detection with semisupervised learning. IEEE Trans. Inf. Forensics Secur.18, 699–714 (2022). [Google Scholar]
- 27.Jiang, B., Zhang, Z., Lin, D., Tang, J. & Luo, B. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11313–11320 (2019).
- 28.Sharafaldin, I. et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp1, 108–116 (2018). [Google Scholar]
- 29.MontazeriShatoori, M., Davidson, L., Kaur, G. & Lashkari, A. H. Detection of doh tunnels using time-series classification of encrypted traffic. In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 63–70 (IEEE, 2020).
- 30.Chen, Y., Ding, Z. & Wagner, D. Continuous learning for android malware detection. In 32nd USENIX Security Symposium (USENIX Security 23), 1127–1144 (2023).
- 31.Yang, L. et al.: Detecting and explaining concept drift samples for security applications. In 30th USENIX Security Symposium (USENIX Security 21), 2327–2344 (2021).
- 32.Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analysed in this study are publicly available from the Canadian Institute for Cybersecurity (CIC). Specifically: CIC-IDS2017 (https://www.unb.ca/cic/datasets/ids-2017.html) and DoHBrw (https://www.unb.ca/cic/datasets/dohbrw-2020.html).For queries about data usage in this work or to request ancillary materials (e.g., data splits or preprocessing scripts), please contact the corresponding author.






























































