. 2025 Jan 2;25(1):211. doi: 10.3390/s25010211

Table 12.

Comparison among RL, GAs, and BERT QA RL + RS in a penetration testing scenario aligned with NIST SP 800-115.

Criterion	RL	GAs	BERT QA RL + RS
Nature of Environment	Dynamic, sequential, with rewards tied to actions	Nonsequential, evaluating solution populations without temporal feedback	Dynamic and sequential; integrates RL rewards and BERT’s semantic context
Continuous Adaptation	Adjusts its policy as the environment evolves (new ports, vulnerabilities)	Difficult; changes require new populations and generations without guaranteed rapid adaptation	Iterative adjustment: RL adapts to novel findings, BERT recalibrates responses, incorporating new $Q, C, A$
Contextual Information	Can leverage structured information (states, rewards) but limited semantic comprehension	No semantic understanding; only evaluates solution fitness without linguistic context	Incorporates BERT’s contextual comprehension, correlating vulnerability descriptions (CVE/CWE) with NIST methodology
Alignment with NIST SP 800-115	RL can implement the cycle (reconnaissance, identification, exploitation) by maximizing rewards at each phase	No natural integration with these phases. GAs optimize a fitness function, lacking a sequential flow suited to recommended stages	Aligns with phases (planning, reconnaissance, vulnerability assessment, exploitation, reporting), leveraging RL and BERT’s semantics
Scalability	Scalable, though potentially requires more computation as complexity increases	Scalable in exploration, but lacks a mechanism guiding adaptive policy changes over time	Scalable; each RL insight is integrated by BERT, facilitating the reuse and expansion of the knowledge base
Final Outcome	An optimal (or near-optimal) policy guiding sequential pentesting actions	A set of candidate solutions without guaranteeing dynamic adaptation or contextual integration	A dynamic policy, informed by semantic context and aligned with NIST guidelines, optimizing tests and leveraging cumulative learning