Re-Learning EXP3 Multi-Armed Bandit Algorithm for Enhancing the Massive IoT-LoRaWAN Network Performance

. 2022 Feb 18;22(4):1603. doi: 10.3390/s22041603

Algorithm 1 The M-EXP3 Algorithm with the Modification.

Parameters:

η = γ * α

[0, 1]

where

α > 0

is a discount factor
initialization:

w_{i} (1) = 1

for all i = 1, …, K.
For each time t = 1, 2, …
At time t,
Receive the experts’ advice vectors

B_{i}

Calculate, for each action i, the probability

P_{i} (t) = (1 - η) \sum_{j = 1}^{N} \frac{w_{i, j} (t) B_{i} (t)}{W_{t}} + \frac{η}{K}

(5)

Calculate the sum of the weights of the actions at time t:

W_{t} = \sum_{j = 1}^{K} w_{j} (t)

(6)

Choose action

I_{t}

according to the max distribution

P_{i} (t)

,
Receive a profit for the action i:

g_{i} (t) \in [0, 1],

(7)

g *_{i} (t) = \{\begin{matrix} g_{i_{t}} (t) / P_{i_{t}} (t) If ACK (r_{j} (i)) is received \\ 0 Otherwise \end{matrix}

(8)

Update

B_{i} (t)

as the reward (here the reward is a function of the expert in addition to the current action)

y_{i} (t) = B_{i} (t) \cdot g (t) = \sum_{i = 1}^{K} B_{i} (t) g *_{i} (t)

(9)

Update the weight of each expert

w_{j} (t + 1) = w_{j} (t) e x p (\frac{η}{K} y_{i} (t))

(10)