|
Algorithm 1: The proposed improved soft actor-critic algorithm. |
|
Initialization: Randomly initialize the parameters of the policy network and the two |
|
Q networks. Set the experience reply D with size of 100,000. |
|
Input: The current communication parameters of the communication parties and the |
| current jamming action of the jammer |
| 1: for episode i = 1, 2, …, J do
|
| 2: for step j = 1, 2, …, N
do
|
| 3: According to the state, , input to the policy network sampling output |
| action, ; |
| 4: The proto-action, , is input to the improved Wolpertinger architecture to |
| obtain the actual executed action, ; |
| 5: Executing action ; |
| 6: Obtaining the next state, , and feedback and calculating the actual |
| reward, ; |
| 7: Storing (,,,) into experience pool D; |
| 8: Sampling the smallest batch, , from experience pool D for training; |
| 9: Updating network parameters A and B for Q1 and Q2; |
| 10: Updating the parameters of the policy network; |
| 11: Updating the parameters of the target Q1 and target Q2 networks; |
| 12: Setting ; |
| 13: end for
|
| 14: end for
|
|
Output: Jamming action for the communication parties at the next moment |