|
Algorithm 2 Adversarial attack trick and adversarial learning MADDPG (A2-MADDPG) algorithm. |
| 1: for N agents, randomly initialize their critic network and actor network
|
| 2: synchronize target networks and with
|
| 3: initialize hyperparameters: experience buffer , mini-batch size m, max episode M, max step T, actor learning rate la, critic learning rate lc, discount factor γ, soft update rate τ |
| 4: for
episode = 1, M
do
|
| 5: reset environment, and receive the initial state x |
| 6: initialize exploration noise of action
|
| 7: for
t = 1, T
do
|
| 8:
|
| 9: for each agent i, select action
|
| 10: execute , rewards , and next state x’ |
| 11: store sample in
|
| 12: for
agenti = 1, n
do
|
| 13: randomly extract m samples
|
| 14: update critic network by Equation (27) |
| 15: update actor network by:
|
| 16: end for
|
| 17: update target networks by
|
| 18: end for
|
| 19: end for
|