Optimal User Scheduling in Multi Antenna System Using Multi Agent Reinforcement Learning

. 2022 Oct 28;22(21):8278. doi: 10.3390/s22218278

Algorithm 2:Value iteration algorithm.

Initialize V arbitrarily

Repeat

$Δ \leftarrow 0$

For each $s \in S$

$v \leftarrow V (s)$ $V (s) \leftarrow m a x_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ V (s^{'})]$

$Δ \leftarrow m a x (Δ, | v - V (s) |)$

until $Δ < θ$ (a small positive number)

output a deterministic policy, $π$ , such that

$π (s) = a r g m a x_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ V (s^{'})]$