A Study on the Impact of Integrating Reinforcement Learning for Channel Prediction and Power Allocation Scheme in MISO-NOMA System

. 2023 Jan 26;23(3):1383. doi: 10.3390/s23031383

Algorithm 1: Developed Q-learning Channel Prediction Structure.

Inputs

2.
Number of Iterations and the size for the channel parameters for every user device.
3.
Initial distance $“ d_{i} ”$ of every user device from the BS.
4.
Path loss parameter $“ ϑ ”$ .
5.
Design random pilot symbols.
6.
Initialize the random channel parameters for each user $“ h_{i j} ”$ based on fading model, $j \in [1, 2, \dots, N]$ and $i \in [1, 2, \dots, M]$ . $N$ is the number of antennas at BS and $M$ is the number of devices in the cell.
7.
Designate the power percentage $“ η_{i} ”$ for each user.
8.
Determine system bandwidth $“ B ”$ , Total transmit power $“ P_{T} ”$ , and noise spectral density $“ N_{o} ”$
9.
Assign the desired channel parameters $“ h_{i d} ”$ and the target rate $“ R_{T} ”$

Procedure

10.
Based on the channel gain ${|h_{i j}|}^{2}$ , total transmit power $“ P_{T} ”$ , and initial power factor for each user $“ η_{i} ”$ , signal to interference noise ratio $“ S I N R_{i} ”$ , minimum required rate $“ R_{i} ”$ can be calculated for each device.
11.
At each iteration, compare the initial generated rate $“ R_{i} ”$ with the target rate $“ R_{T} ”$ .
12.
Update the values for the Q-table that represent the current state and action pair $Q (s, a)$ .

Q-algorithm

13.
identify discount factor $“ γ ”$ , learning rate $“ α ”$ , the current state, and the terminal state.
14.
Choose the next state at random and set it as the next new state.
15.
Inspect all possible actions $“ a_{i} ”$ to move to the new state.
16.
Select the best action $a_{i} \in A$ , which satisfies the maximum value for the Q-value function argmax $Q (s, a)$ to move to the new state.
17
Identify the immediate Reward $“ R ”$ , based on the action implemented to move to the new state.
18.
Based on the following: (1) maximum Q-value $Q (s, a)$ obtained in (16), (2) the corresponding reward $“ R ”$ , (3) the discount factor $“ γ ”$ , then $Q (s, a)$ can be updated based on bellman’s equation

Q (s, a) \leftarrow R + γ argmax Q (s, a)

Outputs

19.
Based on the updated $Q (s, a)$ values in Q-table, the channel coefficients $“ h_{i j} ”$ and channel gain ${|h_{i j}|}^{2}$ can be updated and a new user rate can be calculated and compared to the target rate $“ R_{T} ”$ .
20.
Compute the difference $“ Δ Q ”$ between the updated value function $Q_{n e w} (s, a)$ and the previous $Q (s, a)$ .
21.
Based on $(20), Q (s, a)$ value in the Q-table can be further updated according to $Q (s, a) \leftarrow Q (s, a) + α \cdot Δ Q$
22.
Check whether the terminal state has been reached or the episode has been completed.
23.
Compose predicted channel taps ${\hat{h}}_{i}$