Skip to main content
. 2020 Apr 21;20(8):2361. doi: 10.3390/s20082361
Algorithm 2: Pseudocode for distributed Q-learning
Initialization:
 for each sutSut, autAut do
  initialize Q-table and policy πu(sut)
 end for
Learning:
 loop
  estimate state sut
  generate a random real number x[0,1]
  if x<ε // for exploration
   elect action aut randomly
  else
   select action aut according to πu(sut)
  receive action aBBUt from algorithm1
  determine action au by comparing aut and aBBUt
  execute action aut
  calculate reward rut
  update Q-value Q(sut, aut) and πu(sut)
end loop