End-to-End Automated Lane-Change Maneuvering Considering Driving Style Using a Deep Deterministic Policy Gradient Algorithm

. 2020 Sep 22;20(18):5443. doi: 10.3390/s20185443

Algorithm 1 DDPG algorithm

Randomly initialize critic network

Q (s, a | θ^{Q})

and actor

μ (s | θ^{μ})

with weights

θ^{θ}

and

θ^{μ}

Initialize target network

Q^{'}

and

μ^{'}

with weights

θ^{Q^{'}} \leftarrow θ^{Q}

θ^{μ^{'}} \leftarrow θ^{μ}

Initialize replay buffer R
for episode = 1, M do
Initialize a random process N for action exploration
Receive initial observation state

s_{1}

for t = 1, T do
Select action

a_{t} = μ (s_{t} | θ^{μ}) +

Nt
according to the current policy and exploration noise
Execute action

a_{t}

and observe reward

r_{t}

and observe new state

s_{t + 1}

Store transition

< s_{t}, a_{t}, r_{t}, s_{t + 1} >

in R
Sample a random minibatch of N transitions

< s_{i}, a_{i}, r_{i}, s_{i + 1} >

from R
Set

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

Update critic by minimizing the loss:

L = \frac{1}{N} \sum_{i} (y_{i} - Q {(s_{i}, a_{i} | θ^{Q})}^{2})

Update the actor policy using the sampled gradient:

\nabla_{θ_{o_{μ}}} μ |_{s_{i}} \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ ({s | θ}^{μ}) |_{s = s_{i}}

Update the target networks:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

end for
end for