. 2025 Jul 31;11:e3089. doi: 10.7717/peerj-cs.3089

Algorithm 1 . Soft adversarial asynchronous actor-critic intrusion detection (SA3C-ID).

Input: Train Dataset

D_{T r a i n}

, Test Dataset

D_{T e s t}

, Number of workers

N_{w o r k e r s}

Output: Detection results of policy

π_{θ}

on test dataset

D_{T e s t}

01: Preprocess train and test datasets

02: Initialize datasets by SABPIO algorithm:

D_{T r a i n}^{'} \leftarrow D_{T r a i n}

D_{T e s t}^{'} \leftarrow D_{T e s t}

03: Initialize global experience replay pool:

B_{g l o b l e}^{a} \leftarrow 0

B_{g l o b l e}^{d} \leftarrow 0

04: Initialize

π_{θ}

(defender policy) and

π_{φ}

(attacker policy) randomly

05: Initialize shared parameters

θ

and

φ

for

π_{θ}

and

π_{φ}

respectively

06: Initialize worker threads based on

N_{w o r k e r s}

07: foreach worker in worker threads to do

08: for epoch = 1 to epochs do

09: Initialize the state

s_{0}

\leftarrow

random sampling from

D_{T r a i n}^{'}

10: Choose a set of actions A_a,t based on

π_{φ} (s_{0})

11: Initialize local Mini-Batch buffer

M

\leftarrow

random sampling from

D_{T r a i n}^{'}

, and labels all belong to

A_{a, t}

12: while not done do

13: Choose action

a_{d, t}^{i}

based on

π_{θ} (s_{t})

for each state

s_{t}^{i}

M

14: Obtain rewards for the attacker and the defender:

r_{a, t + 1}

r_{d, t + 1}

15: Choose new set of actions

A_{a, t + 1}

based on

π_{φ} (s_{t})

16: Update local Mini-Batch buffer

M

\leftarrow

random sampling from

D_{T r a i n}

, and labels all belong to

A_{a, t + 1}

17: Calculate the priority weight

w_{t}

of each attacker experience

⟨ s_{t}, a_{a, t}, r_{a, t + 1}, s_{a, t + 1} ⟩

and each defender experience

⟨ s_{t}, a_{d, t}, r_{d, t + 1}, s_{d, t + 1} ⟩

18: Store attacker experience

B_{g l o b l e}^{a} \leftarrow ⟨ s_{t}, a_{a, t}, r_{a, t + 1}, s_{a, t + 1}, d_{t}, w_{t} ⟩

19: Store defender experience

B_{g l o b l e}^{d} \leftarrow ⟨ s_{t}, a_{d, t}, r_{d, t + 1}, s_{d, t + 1}, d_{t}, w_{t} ⟩

20: Update worker attacker agent by sampling

M

experience from

B_{g l o b l e}^{a}

based on their weights

w_{t}

21: Update worker defender agent by sampling

M

experience from

B_{g l o b l e}^{d}

based on their weights

w_{t}

22: end while

23: Update global attacker agent by worker attacker agent

24: Update global defender agent by worker defender agent

25: end for

26: end foreach

27: Test policy

π_{θ}

on test dataset

D_{T e s t}^{'}

28: Return Detection results