Biped Robots Control in Gusty Environments with Adaptive Exploration Based DDPG

. 2024 Jun 8;9(6):346. doi: 10.3390/biomimetics9060346

Algorithm 2 DDPG with HER and Adaptive Exploration

1:
Initialize online policy network $μ_{θ}$ with weights $θ$
2:
Initialize target policy network $μ_{θ^{'}}$ with weights $θ^{'} \leftarrow θ$
3:
Initialize online Q network $Q_{ϕ}$ with weights $ϕ$
4:
Initialize target Q network $Q_{ϕ^{'}}$ with weights $ϕ^{'} \leftarrow ϕ$
5:
Initialize experience replay pool $D$ to capacity N
6:
for episode = 1, M do
7:
Receive initial observation state $s_{1}$
8:
for t = 1, T do
9:
Obtain $μ$ from the adaptive exploration adjustment unit
10:
Calculate exploration noise
11:
Select action $a_{t} = μ_{θ} (s_{t}) + exploration noise$
12:
Execute action $a_{t}$ in the environment
13:
Observe reward $r_{t}$ and new state $s_{t + 1}$
14:
Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $D$
15:
Store modified transitions with alternative goals
16:
Sample random mini-batch of K transitions $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from $D$
17:
Set $y_{i} = r_{i} + γ Q_{ϕ^{'}} (s_{i + 1}, μ_{θ^{'}} (s_{i + 1}))$
18:
Update $ϕ$ by minimizing the loss: $L = \frac{1}{K} \sum_{i} {(y_{i} - Q_{ϕ} (s_{i}, a_{i}))}^{2}$
19:
Update $θ$ using the sampled policy gradient:
$\nabla_{θ} J \approx \frac{1}{K} \sum_{i} \nabla_{a} Q_{ϕ} (s, a) {| s = s_{i}, a = μ θ (s_{i}) \nabla_{θ} μ_{θ} (s) |}_{s_{i}}$
20:
Update the target networks:
$\begin{matrix} θ^{'} & \leftarrow & τ θ + (1 - τ) θ^{'} \\ ϕ^{'} & \leftarrow & τ ϕ + (1 - τ) ϕ^{'} \end{matrix}$
21:
end for
22:
end for