Skip to main content
. 2025 Aug 4;25(15):4802. doi: 10.3390/s25154802
Algorithm 1 Deep deterministic policy gradient algorithm.

Input I: Status and reward st,rt,st+1

Output I: Action atΔti

1. for each bus, from t=1 to t=T;

2. Initial test system parameters, including network parameters, reward and punishment functions, etc.

3. Combine with adjustment constraints and random factor, this model determined the action at, atA, and transfer the action into the simulation part.

4. The simulation part executes action at, and it will receive the reward value rt and the new status st+1.

5. If the sample pool overflows, then delete the earliest sample records in chronological order.

6. The actor network will put the st,at,rt,st+1 into experience playback, which supplies the train data for the online network.

7. Sampling from the experience pool, gain N sets of sample data st,at,rt,st+1 as the training set for the online actor network and Q network.

8. Use the standard BP method to calculate the gradient of the online Q network.

9. Update the parameter θe of the online Q network.

10. Calculate the policy gradient (PG) of the actor network.

11. Update the parameter θn of the online actor network.

12. Update the parameters n1,e1 of the target network.

13. End for.