Skip to main content
. 2022 Oct 10;24(10):1441. doi: 10.3390/e24101441
Algorithm 1: The proposed improved soft actor-critic algorithm.
Initialization: Randomly initialize the parameters of the policy network and the two
Q networks. Set the experience reply D with size of 100,000.
Input: The current communication parameters of the communication parties and the
current jamming action of the jammer
1: for episode i = 1, 2, …, J do
2:     for step j = 1, 2, …, N do
3:          According to the state, st, input to the policy network sampling output
action,               at;
4:          The proto-action, at, is input to the improved Wolpertinger architecture to
             obtain the actual executed action, at;
5:          Executing action at;
6:          Obtaining the next state, st+1, and feedback and calculating the actual
reward,              r;
7:          Storing (st,at,st+1,r) into experience pool D;
8:          Sampling the smallest batch, NB, from experience pool D for training;
9:          Updating network parameters A and B for Q1 and Q2;
10:        Updating the parameters of the policy network;
11:        Updating the parameters of the target Q1 and target Q2 networks;
12:        Setting st=st+1;
13:    end for
14: end for
Output: Jamming action for the communication parties at the next moment