|
Proximal Policy Optimization (PPO) |
|
n_steps |
Number of steps to run for each env per update |
1024 |
nminibatches |
Number of training minibatches per update |
32 |
lam |
Bias vs variance trade-off factor for GAE (λ) |
0.98 |
gamma |
Discount factor (γ) |
0.999 |
learning_rate |
Learning rate |
2e-4 |
|
Deep Deterministic Policy Gradient (DDPG) |
|
memory_limit |
Size of replay buffer |
1,000,000 |
normalize_obs |
Whether agent observations are normalized |
True |
gamma |
Discount factor |
0.98 |
actor_lr |
Learning rate for actor network |
0.00156 |
critic_lr |
Learning rate for critic network |
0.00156 |
batch_size |
Size of the batch for learning the policy |
256 |
action_noise |
Action noise type and magnitude |
OrnsteinUhlenbeck |
|
|
(μ = [0, 0], σ = [0.5, 0.5]) |
|
Twin Delayed DDPG (TD3) |
|
buffer_size |
Size of replay buffer |
1,000,000 |
train_freq |
Update the model every n steps |
1000 |
gradient_steps |
Gradient updates after each step |
1000 |
learning_starts |
Steps before learning starts |
10000 |
action_noise |
Action noise type and magnitude |
|
|
Soft Actor Critic (SAC) |
|
None |
All hyperparameters are default |
N/A |