|
Algorithm 2 Distributed Q-learning for device m with Boltzmann learning policy. |
for to N
do
end for
for to
do
if
then
exit.
end if
Choose the action according to its current action selection probability vector .
Based on , change the computing state and sub-channel.
Observe sub-channel interference and calculate the reward according to (26).
for
to N
do
Calculate the according to (25) with the learning rate formulated at (33)
Update the action selection probability vector according to (31).
end for
end for
|