Skip to main content
. 2019 Feb 24;19(4):960. doi: 10.3390/s19040960
Algorithm A1 Double Q-learning with CNN.
  •   1:

    Initialize reward function Q with weights θ

  •   2:

    Initialize target reward function Q^ with weights θ^

  •   3:

    Initialize replay memory D

  •   4:

    for episode = 1,M do

  •   5:

     Initialize states ϕ1=(s1,n1,m1)

  •   6:

    for t = 1,T do

  •   7:

      Select action at using ϵ-greedy policy based on Q

  •   8:

      Execute action at, obtain reward rt and state st+1

  •   9:

      Update states ϕt+1=(st+1,nt+1,mt+1)

  • 10:

      Store transition {ϕt,at,rt,ϕt+1} in D

  • 11:

      Sample random minibatch of transitions {ϕj,aj,rj,ϕj+1} from D

  • 12:
      Set
    yj=rjifepisodeterminatesatstepj+1rj+γQ^(ϕj+1,argmaxaQ(ϕj+1,a))otherwise
  • 13:

      Perform a gradient descent step on HuberLoss(yj,Q(ϕj,aj;θ)) for θ

  • 14:

      Every C steps reset Q^=Q

  • 15:

      Terminate the episode if source is found

  • 16:

    end for

  • 17:

    end for