| Algorithm 1: MEP–DDPG Algorithm |
| Randomly initialize critic network and actor with weights and
|
| Initialize target network and with weights ,
|
| Initialize experience pools
|
|
for episode = 1,
do
|
| Initialize a random process for action exploration |
| Receive initial observation state
|
|
fordo
|
| According to probability , select action based on current policy and exploration noise or select action by MPC-SA |
| Execute action and observe reward and observe new state
|
| Store experience in specific depending on the source of . |
| Follow a specific sampling policy, sample a random minibatch of experiences from
|
| Set
|
| Update critic by minimizing the loss:
|
| Update the actor policy using the sampled policy gradient:
|
| Update the target networks:
|
| end for
|
| end for |