Two Tier Slicing Resource Allocation Algorithm Based on Deep Reinforcement Learning and Joint Bidding in Wireless Access Networks

. 2022 May 4;22(9):3495. doi: 10.3390/s22093495

Algorithm 1 DQN and Joint Bidding Algorithm for Upper Tier Bandwidth Allocation

Initialize the Bidding pool B of MVNO and corresponding lower tier action selection table A;

Initialize the action-value function

Q

, target action-value function

\hat{Q}

the replay memory D to capacity N

Each MVNO

m \in M

estimates the maximum total needed rate and minimum total needed
rate of linked users, then create the Bidding pool B;

For

b_{m}

in B do

Find the lower tier optimal allocation action and store it in table A;

end for

Random choose an action

a_{t}

i.e., bidding value

b_{m} ϵ B

and BS distributes

c_{m}

to each MVNO according to (2);

Repeat

For t = 1, to T, do

10:

Calculate the ratio of the allocated bandwidth to its required minimum rate, and take it as the current state S = s of the last iteration;

11:

For m = 1 to M, do

12:

Each MVNO m allocates optimal bandwidth

c_{j}^{m}

to its users according to table A;

13:

Each MVNO m calculates the

v_{m}

by (9) and (10);

14:

Each MVNO m calculates the penalty

q_{m}

by (4);

15:

Each MVNO m and calculates the profit

y_{m}

by (3) and get the reward

r_{m}

;

16:

End for

17:

Calculate the total system utility F according to (5);

18:

Calculate the total reward r;

19:

Choose an action

a_{t}

i.e., bidding value

b_{m} ϵ B

according to the policy of DQN;

20:

InP distributes

c_{m}

to each MVNO according to (2);

21:

Get the state S = s’ after the selection action of this iteration;

22:

#Train DQN

23:

The agent i.e., each MVNO inputs

(s, a, s^{'}, r)

into the DQN;

24:

The agent stores transition

(s, a, s^{'}, r)

in D;

25:

The agent sample random minibatch of transitions

(s_, a_, s^{'}_, r_)

from D;

26:

Set

y_= {\begin{matrix} r_{-} \\ r_+ γ m a x_{a^{*}} \hat{Q} (s^{'}_, a^{*}; θ^{-}) \end{matrix} \begin{matrix} if episode terminates at step_+ 1 \\ otherwise \end{matrix}

27:

The agent perform a gradient descent step on

{(y_- Q (s_, a_; θ))}^{2}

with respect to the network parameters θ;

28:

Every steps reset

\hat{Q} = Q

;

29:

End for

30:

Until The predefined maximum number of iterations has been completed.