|
Input: Train Dataset
, Test Dataset
, Number of workers
|
|
Output: Detection results of policy
on test dataset
|
| 01: Preprocess train and test datasets |
| 02: Initialize datasets by SABPIO algorithm:
,
|
| 03: Initialize global experience replay pool:
,
|
| 04: Initialize
(defender policy) and
(attacker policy) randomly |
| 05: Initialize shared parameters
and
for
and
respectively |
| 06: Initialize worker threads based on
|
| 07: foreach worker in worker threads to
do
|
| 08: for epoch = 1 to epochs do
|
| 09: Initialize the state
random sampling from
|
| 10: Choose a set of actions Aa,t based on
|
| 11: Initialize local Mini-Batch buffer
random sampling from
, and labels all belong to
|
| 12: while
not done do
|
| 13: Choose action
based on
for each state
in
|
| 14: Obtain rewards for the attacker and the defender:
,
|
| 15: Choose new set of actions
based on
|
| 16: Update local Mini-Batch buffer
random sampling from
, and labels all belong to
|
| 17: Calculate the priority weight
of each attacker experience
and each defender experience
|
| 18: Store attacker experience
|
| 19: Store defender experience
|
| 20: Update worker attacker agent by sampling
experience from
based on their weights
|
| 21: Update worker defender agent by sampling
experience from
based on their weights
|
| 22: end while
|
| 23: Update global attacker agent by worker attacker agent |
| 24: Update global defender agent by worker defender agent |
| 25: end for
|
| 26: end foreach
|
| 27: Test policy
on test dataset
|
| 28: Return Detection results |