Skip to main content
. 2026 Mar 2;16:8147. doi: 10.1038/s41598-025-34505-2

Table 19.

Ablation study on hybrid NLP pipeline Components.

Component Description (from Code & Paper) IOC Extraction Accuracy Marginal Gain (vs. Previous) Key Strengths/Limitations Latency Impact (relative)
Regex-only Hardcoded patterns for IPs, domains, hashes, URLs (primary deterministic layer in code). ~ 82% (strong on structured; estimated from gradient in Table 10) Baseline Perfect for fixed formats; fails on contextual/variant terms (e.g., malware names). Lowest (~ 20ms for 200 reports)
Regex + spaCy Adds spaCy’s efficient matcher and custom entity ruler for semi-structured cyber entities. ~ 88% (+ 6% over regex-only) + 6% Improves domains/malware slightly via rules; still weak on novel semantics. Low (~ 35ms)
Regex + spaCy + BERT (Full Hybrid) BERT fine-tuned for contextual NER, merging results (final post-processing in code). 95% (reported; 95% IPs, 92% domains, 85% malware, 78% techniques) + 7% over spaCy-added Synergistic: BERT boosts semantic IOCs; overall + 13% vs. regex-only. Moderate (~ 54ms; 55% reduction vs. BERT-only baseline of ~ 120ms)