Stability Control of a Biped Robot on a Dynamic Platform Based on Hybrid Reinforcement Learning

. 2020 Aug 10;20(16):4468. doi: 10.3390/s20164468

Algorithm 2 Hybrid Reinforcement Learning Algorithm

1: Initialize: State space

S

, and action space

A

2: Apply discrete actions randomly to the robot and collect data set

D_{1}

3: Using

D_{1}

to generate

D_{2}

, and then obtain the transition model

ℳ

4: Obtain the reduced action space

a_{reduced}^{i}

5: Initialize

A

using

a_{rough}^{i} \in [a_{rough (\min)}^{i}, a_{rough (\max)}^{i}]

6: Initialize replay memory

D

, to capacity

N

, parameter vector

θ

7: For

episode = 1, K

8: Reset the robot and the platform to their initial position, transition temporary table

T = \emptyset

9: For

t = 1, T

10: Obtain current state

S_{t}

from sensors’ reading

11: Select a random action from

A_{t}

with probability

ϵ

, otherwise select

a_{t}^{k}

a_{t}^{k} = {argmax}_{a \in A} Q (S_{t}, a | θ)

, observe the next state

s_{t + 1}

, and receive an immediate reward

r_{t + 1}

12: Append transition

(S_{t}, a_{t}, r_{t}, R_{t}^{λ}, S_{t + 1})

T

13: If

S_{𝓉 + 1}

is within stable region

S_{s}

14: Update

R_{T}^{λ}

using Algorithm 1

15: Store

T

D

, refresh

T

T = \emptyset

16: End If

17: Sample random minibatch of transitions from

D (S_{j}, a_{j}, r_{j}, R_{j}^{λ}, S_{j + 1})

j = 1, 2, \dots, P

18: Apply a gradient descent on

θ

to improve Q-function

θ \leftarrow θ + α \nabla_{θ} {(R_{j}^{λ} - Q (S_{j}, a_{j} | θ))}^{2}

19: Every

C

step, update

R_{D}^{λ}

using Algorithm 1

20: End For

21: End For