Algorithm 1: MEP–DDPG Algorithm |
Randomly initialize critic network and actor with weights and
|
Initialize target network and with weights ,
|
Initialize experience pools
|
for episode = 1,
do
|
Initialize a random process for action exploration |
Receive initial observation state
|
fordo
|
According to probability , select action based on current policy and exploration noise or select action by MPC-SA |
Execute action and observe reward and observe new state
|
Store experience in specific depending on the source of . |
Follow a specific sampling policy, sample a random minibatch of experiences from
|
Set
|
Update critic by minimizing the loss:
|
Update the actor policy using the sampled policy gradient:
|
Update the target networks:
|
end for
|
end for |