Table 19.
Ablation study on hybrid NLP pipeline Components.
| Component | Description (from Code & Paper) | IOC Extraction Accuracy | Marginal Gain (vs. Previous) | Key Strengths/Limitations | Latency Impact (relative) |
|---|---|---|---|---|---|
| Regex-only | Hardcoded patterns for IPs, domains, hashes, URLs (primary deterministic layer in code). | ~ 82% (strong on structured; estimated from gradient in Table 10) | Baseline | Perfect for fixed formats; fails on contextual/variant terms (e.g., malware names). | Lowest (~ 20ms for 200 reports) |
| Regex + spaCy | Adds spaCy’s efficient matcher and custom entity ruler for semi-structured cyber entities. | ~ 88% (+ 6% over regex-only) | + 6% | Improves domains/malware slightly via rules; still weak on novel semantics. | Low (~ 35ms) |
| Regex + spaCy + BERT (Full Hybrid) | BERT fine-tuned for contextual NER, merging results (final post-processing in code). | 95% (reported; 95% IPs, 92% domains, 85% malware, 78% techniques) | + 7% over spaCy-added | Synergistic: BERT boosts semantic IOCs; overall + 13% vs. regex-only. | Moderate (~ 54ms; 55% reduction vs. BERT-only baseline of ~ 120ms) |