TABLE 3.
Non-default hyperparameters for each RL algorithm.
| Hyperparameter | Description | Value |
|---|---|---|
| Proximal Policy Optimization (PPO) | ||
| n_steps | Number of steps to run for each env per update | 1024 |
| nminibatches | Number of training minibatches per update | 32 |
| lam | Bias vs variance trade-off factor for GAE (λ) | 0.98 |
| gamma | Discount factor (γ) | 0.999 |
| learning_rate | Learning rate | 2e-4 |
| Deep Deterministic Policy Gradient (DDPG) | ||
| memory_limit | Size of replay buffer | 1,000,000 |
| normalize_obs | Whether agent observations are normalized | True |
| gamma | Discount factor | 0.98 |
| actor_lr | Learning rate for actor network | 0.00156 |
| critic_lr | Learning rate for critic network | 0.00156 |
| batch_size | Size of the batch for learning the policy | 256 |
| action_noise | Action noise type and magnitude | OrnsteinUhlenbeck |
| (μ = [0, 0], σ = [0.5, 0.5]) | ||
| Twin Delayed DDPG (TD3) | ||
| buffer_size | Size of replay buffer | 1,000,000 |
| train_freq | Update the model every n steps | 1000 |
| gradient_steps | Gradient updates after each step | 1000 |
| learning_starts | Steps before learning starts | 10000 |
| action_noise | Action noise type and magnitude | |
| Soft Actor Critic (SAC) | ||
| None | All hyperparameters are default | N/A |