Deep Reinforcement Learning for Charging Scheduling of Electric Vehicles Considering Distribution Network Voltage Stability

. 2023 Feb 2;23(3):1618. doi: 10.3390/s23031618

Algorithm 1 DDPG-based Learning Algorithm
1	Initialize weights $θ^{Q}$ and $θ^{μ}$ of critic network $Q (s, a \| θ^{Q})$ and actor network $μ (s \| θ^{μ})$
2	Initialize weights $θ^{Q'} \leftarrow θ^{Q}$ , $θ^{μ'} \leftarrow θ^{μ}$ of target network $Q'$ and $μ'$
3	Initialize experience replay buffer $R$
4	for $e p i s o d e$ = 1, 2, …, $M$ do
5		Receive initial observation state $s_{1}$
6		for $t$ = 1, 2, …, $T$ do
7			Choose $a_{t} = μ (s_{t} \| θ^{μ})$ and do simulation using pandapower
8			Observe reward $r_{t}$ and the next state $s_{t + 1}$
9			Store transition $(s_{t}, a_{t}, r_{1}, s_{t + 1})$ in $R$
10			Sample a random minibatch of $B$ transitions $(s_{i}, a_{i}, r_{i}, s_{i + 1})$ from $R$
11			Set $y_{i} = r_{i} + γ \cdot Q' (s_{i + 1}, μ' (s_{i + 1} \| θ^{μ'}) \| θ^{Q'})$ according to Equation (26)
12			Update critic network parameters by minimizing the loss, see Equation (27): $L o s s = \frac{1}{B} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} \| θ^{Q}))}^{2}$
13			Update the actor policy using the sampled policy gradient, see Equation (28): $\nabla_{θ^{μ}} J \approx \frac{1}{B} \sum_{i} \nabla_{a} Q (s, a \|θ^{Q}) \|_{s = s_{i}, a = μ (s_{i})} \cdot \nabla_{θ^{μ}} μ (s \| θ^{μ}) \|_{s_{i}}$
14			Softly update the target networks using the updated critic and actor network parameters: $θ^{Q'} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q}$ $θ^{μ'} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ}$
15		end for
16	end for