| Algorithm 1 DDPG algorithm |
| Randomly initialize critic network and actor with weights and Initialize target network and with weights , Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state for t = 1, T do Select action Nt according to the current policy and exploration noise Execute action and observe reward and observe new state Store transition in R Sample a random minibatch of N transitions from R Set Update critic by minimizing the loss: Update the actor policy using the sampled gradient: Update the target networks: end for end for |