Algorithm 1: Multi pass DQN (MP-DQN) Algorithm. |
Input: Probability distribution , mini batch size , exploration parameter , learning rates . Initialization: actor weights and actor parameter weights () For t = 1, 2, 3, T do Estimate the action parameters by actor network Choose the action based on the greedy policy: Execute action , receive immediate reward and next state Save the experience () into replay memory Select mini batch size randomly from the replay memory Define the target y(t) by Select the diagonal element from Choose the best action by argmax from diagonal elements Use the to estimate the gradients and Update the weights parameters |