Skip to main content
. 2025 Nov 14;25(22):6956. doi: 10.3390/s25226956
Algorithm 1 Training process of the improved D3QN
  • 1:

    Input: Total iteration rounds T, attenuation factor γ, and exploration rate ε.

  • 2:

    Initialize the evaluation network parameters θ, target network parameters θ=θ, θ update frequency x, and replay buffer D;

  • 3:

    for t=1 to T do

  • 4:

          Randomly select actions according to the probability of ε in the optimized exploration space e, otherwise select actions according to the greedy policy;

  • 5:

          Taking action, the agent gets new status st+1 and reward;

  • 6:

          Store combination (st,at,Rt,st+1) to D;

  • 7:

          Randomly select a batch of (st,at,Rt,st+1) from D and calculate Q-Value of the evaluation network;

  • 8:

          Calculate loss and update parameters θ;

  • 9:

          Every x steps reset θ=θ, update the target Q-values, and reduce ε value;

  • 10:

         st=st+1;

  • 11:

    end for