Skip to main content
. 2021 Jul 2;21(13):4560. doi: 10.3390/s21134560
Algorithm 1 Learning Algorithm
  • 1:

    Initialize policy and value function parameters θπ,θV

  • 2:

    Set the maximum episode N and maximum step T

  • 3:

    repeat

  • 4:

    for i in [0,N1] do

  • 5:

      Randomly initialize the states s0 of vehicle

  • 6:

      for t in [0,T1] do

  • 7:

       Making a decision at according to π(a|st)

  • 8:

       Evaluate st+1 according to (18)

  • 9:

       Collect Dt=st,at,π(at|st),rt,st+1

  • 10:

      Save trajectory τi={D0,D1,,DT1} to memory buffer B

  • 11:

     Randomly sample M trajectories τ from B

  • 12:

    for τ in τ do

  • 13:

      set tmp=0

  • 14:

      for j=T1 to 0 do

  • 15:

       Qj=rj+γV^(sj+1)

  • 16:

       Aj=QjV^(sj)

  • 17:

       Ajtrace=Aj+γ×tmp

  • 18:

       tmp=min1,π(at|st;θ)π(at|st;θold)Ajtrace

  • 19:

       Vjtrace=Vj^+tmp

  • 20:

      g^policy=1Tj=0T1Lpolicy,j

  • 21:

      Update θπ using Adam optimizer by g^policy

  • 22:

      g^value=1Tj=0T1Lvalue,j

  • 23:

      Update θV using Adam optimizer by g^value

  • 24:

    until training success