Q-Learning Based Joint Energy-Spectral Efficiency Optimization in Multi-Hop Device-to-Device Communication

. 2020 Nov 23;20(22):6692. doi: 10.3390/s20226692

Algorithm 2 Next hop selection algorithm at D2D device i

Input: Packet with RRH destination $R R H_{d}$
Output: Best next hop to $R R H_{d}$
Variables: RoutingTable, j
while Battery lifetime is not equal to zero do
Receive a packet with destination RRH
Determine the next-hop corresponding to the path with the smallest path quality (PQ):
$P Q = \sum_{l = 1}^{L} \frac{{(L Q M_{l})}^{n_{H}}}{1 - L Q M_{l}}$
Send packet to j with selected level of $p_{d c}$
Receive feedback/reward, $R f$ from j
$R f = \{\begin{matrix} \frac{L Q M_{l}}{1 - L Q M_{l}} + L Q M_{l} \times A & for n \leq N . \\ \frac{L Q M_{l}}{1 - L Q M_{l}} & n > N . \end{matrix}$
Update the Q value for Q-learning
Update the corresponding entry in the table, RRHTable
end while