Table 2.
The hyperparameters and their values used to train the VTP.
| Hyperparameter | Value | Description |
|---|---|---|
| learning rate | 1x10−5 | The learning rate used by the VTP |
| minibatch size | 16 | The number of training samples that are used to update θi in Equation (4) |
| target update frequency | 500 | The frequency with which the target parameters θ+ are updated |
| discount factor | 0.7 | Discount factor γ used by the Q learning |
| initial exploration | 0.999 | Initial value of ε from ε-greedy exploration |
| final exploration | 0.333 | Final value of ε from ε-greedy exploration |
| replay memory | 125000 | The number of state action pairs that are stored |
| number of episodes | 200 | Total number of training episodes |
| number of steps | 30 | Maximum number of time steps in each episode |