|
Algorithm 1 Discretized soft-actor-critic algorithm. |
-
1:
Initialize the experience buffer D
-
2:
Initialize the weight with 1
-
3:
Initialize the actor network with random parameter
-
4:
Initialize main critic networks with random parameters for
-
5:
Initialize target critic networks with parameters as main critic networks
-
6:
for each training episode do
-
7:
Observe initial state
-
8:
for each step
do
-
9:
Generate the action
-
10:
Execute the action
-
11:
Observe the next state and the reward
-
12:
Store the experience at the experience buffer D
-
13:
Sample a mini-batch of a few experiences from the buffer D
-
14:
Calculate the target state value based on Equation (9)
-
15:
Update the main critic network based on the gradient in Equation (10)
-
16:
Update the actor network based on the gradient in Equation (12)
-
17:
Update the weight based on the gradient in Equation (13)
-
18:
For every B steps, use soft update for the target critic networks based on Equation (14)
-
19:
end for
-
20:
end for
|