|
Algorithm 1: MDRLAT |
| 1: Initialize experience replay buffer to capacity , parameters of online network , parameters of target network , frequency to update target network |
| 2: for
do
|
| 3: Obtain from the environment, with probability select a random action , otherwise select
|
| 4: Execute action , transit to next state and receive a reward
|
| 5: Store transition in
|
| 6: Randomly sample a batch of transition from
|
| 7: if episode terminates at step
then
|
| 8: Set
|
| 9: else
|
| 10: Set
|
| 11: end if
|
| 12: Compute loss
|
| 13: Feed into the online network to get estimated velocity ,
|
| 14: Compute loss
|
| 15: Update the parameters of the online network
|
| 16: if
mod
then
|
| 17: Update the parameters of the target network
|
| 18: end if
|
| 19: end for
|