Skip to main content
. 2021 Sep 13;8:738113. doi: 10.3389/frobt.2021.738113

TABLE 3.

Non-default hyperparameters for each RL algorithm.

Hyperparameter Description Value
Proximal Policy Optimization (PPO)
n_steps Number of steps to run for each env per update 1024
nminibatches Number of training minibatches per update 32
lam Bias vs variance trade-off factor for GAE (λ) 0.98
gamma Discount factor (γ) 0.999
learning_rate Learning rate 2e-4
Deep Deterministic Policy Gradient (DDPG)
memory_limit Size of replay buffer 1,000,000
normalize_obs Whether agent observations are normalized True
gamma Discount factor 0.98
actor_lr Learning rate for actor network 0.00156
critic_lr Learning rate for critic network 0.00156
batch_size Size of the batch for learning the policy 256
action_noise Action noise type and magnitude OrnsteinUhlenbeck
(μ = [0, 0], σ = [0.5, 0.5])
Twin Delayed DDPG (TD3)
buffer_size Size of replay buffer 1,000,000
train_freq Update the model every n steps 1000
gradient_steps Gradient updates after each step 1000
learning_starts Steps before learning starts 10000
action_noise Action noise type and magnitude N(μ=0,σ=0.1)
Soft Actor Critic (SAC)
None All hyperparameters are default N/A