Q-Learning Based Joint Energy-Spectral Efficiency Optimization in Multi-Hop Device-to-Device Communication

. 2020 Nov 23;20(22):6692. doi: 10.3390/s20226692

Algorithm 3 Proposed reinforcement learning algorithm

Initialize $Q (s, a) = 0$ where s is the set of states and a is the set of actions
while Battery lifetime is not equal to zero do
Determine current state
Select action a based on policy
$\frac{e^{Q (s, a)} / ω}{\sum_{a} (e^{Q (s, a) / ω})}$
Execute the selected action
Calculate the reward
$R f = \{\begin{matrix} \frac{L Q M_{l}}{1 - L Q M_{l}} + L Q M_{l} \times A & for n \leq N . \\ \frac{L Q M_{l}}{1 - L Q M_{l}} & n > N . \end{matrix}$
Calculate the learning rate
$φ = \frac{Z}{v i s i t e d (s, a)}$
Calculate Q value for the executed action
$Q_{t + 1} (s_{t}, a_{t}) = (1 - φ) Q_{t} (s_{t}, a_{t}) + φ (R f (s_{t + 1}) + Γ V_{t} (s_{t + 1}))$
Calculate the value function for the executed action
$V_{t + 1} (s_{t}) = max_{a \in A} Q_{t + 1} (s_{t}, a)$
Update the utility table of the scheduler agent
$U (q) = (1 - Υ) U (q) + Υ \sum_{i} R f_{i}$
Move to the next state based on the executed action
end while