| Algorithm 1: DQN-PTR to solve |
|
01: Initialize the Q-Network Q with random weights 02: Initialize the Target Q-Network with weights 03: Initialize replay memory D to capacity N 04: Initialize 05: Initialize optimal trajectory to empty and maximum total return 06: For do 07: Initialize sum reward 08: Initialize state , 09: For each step t do 10: If or then 11: If then 12: Select a random action 13: else 14: Set 15: end if 16: Execute action , observe next state and reward according to environment 1 17: else 18: Set according 19: Execute action , observe next state and reward according to environment 2 20: end if 21: 22: Store transition in D 23: If episode terminates at step then 24: If then 25: Replace the trajectory in with 26: 27: else if and then 28: Replace the trajectory in with 29: end if 30: break 31: end if 32: Sample random mini-batch of transitions from D 33: Set 34: Perform a gradient descent step on with respect to the network parameters 35: If then 36: 37: end if 38: Every C steps reset 39: end for 40: end for |