| Algorithm 1 Deep deterministic policy gradient algorithm. |
|
Input I: Status and reward Output I: Action 1. for each bus, from to ; 2. Initial test system parameters, including network parameters, reward and punishment functions, etc. 3. Combine with adjustment constraints and random factor, this model determined the action , , and transfer the action into the simulation part. 4. The simulation part executes action , and it will receive the reward value and the new status . 5. If the sample pool overflows, then delete the earliest sample records in chronological order. 6. The actor network will put the into experience playback, which supplies the train data for the online network. 7. Sampling from the experience pool, gain N sets of sample data as the training set for the online actor network and Q network. 8. Use the standard BP method to calculate the gradient of the online Q network. 9. Update the parameter of the online Q network. 10. Calculate the policy gradient (PG) of the actor network. 11. Update the parameter of the online actor network. 12. Update the parameters of the target network. 13. End for. |