Skip to main content
. 2024 Apr 14;26(4):331. doi: 10.3390/e26040331
Algorithm 1: DDPG
1. Initialize: The Critic networks Q(s,a|θQ) and Actor network μ(s|θμ); the weights are θQ and θμ.
The Critic target networks Q(s,a|θQ) and Actor target network μ have weights θQθQ and θμθμ
The experience playback buffer (R) has size n. Empty the experience playback buffer (R).
2. for episode = 1, 2, …, T do
3. Reset the simulation parameters of the energy dispatch system to obtain the initial observation state, s1.
4.     for i = 1, 2, …, I do
5.     Normalize state si to si.
6.     Obtain Actor network action ai and noise ni:
                                                        ai=min(max(μ(si|θμ)+ni,1),1)
7.     Execute action ai, obtain the reward, ri, and observe the new state, si+1.
8.     Store transmission (si,ai,ri,si+1) to the Replay Buffer (R).
9.     Select a batch of transition (sj,aj,rj,sj+1) from R, j=1,2,,I
10.     Calculate Qtarget,j=rj+γQ(sj+1,aj|θQ)
11.     Update the Critic network parameters θjQ based on the mean square
    loss function:
                                                        L(θQ)=1NN=1N((Qtarget,jQ(sj,aj|θQ))2).
12.     Update the Actor network using the stochastic policy gradient:
                                                        θμJ1NjaQ(s,a|θQ)|s=sj,a=μ(sj)θμμ(s|θμ)|sj.
13.     Update the target network parameters:
                                                        θQτθiQ+(1τ)θiQ,
                                                        θμτθμ+(1τ)θμ.
14.     end for
15 end for