Hierarchical Reinforcement Learning for Quadrupedal Robots: Efficient Object Manipulation in Constrained Environments

. 2025 Mar 4;25(5):1565. doi: 10.3390/s25051565

Algorithm 1 Proximal Policy Optimization (PPO)

Require: Initialize the policy parameters $θ$ and the value function parameters $ϕ$

while not converged, do

for each training iteration, do

for each environment step, do

$s_{t} \leftarrow$ current state

$a_{t} \leftarrow$ action sampled from policy $π_{θ}$

$r_{t} \leftarrow$ reward received

$s_{t + 1} \leftarrow$ next state after executing $a_{t}$

Store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in buffer

end for

compute the advantage estimates ${\hat{A}}_{t}$ using Generalized Advantage Estimation (GAE)

for each policy update, step do

$L (θ) \leftarrow E [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$

update policy $θ$ using gradient ascent on $L (θ)$ $L_{V} (ϕ) \leftarrow {(V_{ϕ} (s_{t}) - R_{t})}^{2}$

update value function $ϕ$ by minimizing $L_{V} (ϕ)$

end for

end while