Deep Reinforcement Learning Approach with Multiple Experience Pools for UAV’s Autonomous Motion Planning in Complex Unknown Environments

. 2020 Mar 29;20(7):1890. doi: 10.3390/s20071890

Algorithm 1: MEP–DDPG Algorithm

Randomly initialize critic network

Q (s, a | θ^{Q})

and actor

μ (s | θ^{μ})

with weights

θ^{Q}

and

θ^{μ}

Initialize target network

Q^{'}

and

μ^{'}

with weights

θ^{Q^{'}} \leftarrow θ^{Q}

θ^{μ^{'}} \leftarrow θ^{μ}

Initialize experience pools

R_{1}, R_{2} \dots R_{X}

for episode = 1,

M

Initialize a random process

N

for action exploration

Receive initial observation state

s_{1}

for

t = 1, T

According to probability

η

, select action

a_{t} = μ (s_{t} | θ^{μ}) + N_{t}

based on current policy and exploration noise or select action

a_{t}

by MPC-SA

Execute action

a_{t}

and observe reward

r_{t}

and observe new state

s_{t + 1}

Store experience

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in specific

R

depending on the source of

a_{t}

Follow a specific sampling policy, sample a random minibatch of

N

experiences

(s_{i}, a_{i}, r_{i}, s_{i + 1})

from

R_{1}, R_{2} \dots R_{X}

Set

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

Update critic by minimizing the loss:

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2}

Update the actor policy using the sampled policy gradient:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s_{i}}

Update the target networks:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

end for