Cooperative Downloading for LEO Satellite Networks: A DRL-Based Approach

. 2022 Sep 10;22(18):6853. doi: 10.3390/s22186853

Algorithm 1 Discretized soft-actor-critic algorithm.

1:
Initialize the experience buffer D
2:
Initialize the weight $α$ with 1
3:
Initialize the actor network $π_{ϕ}$ with random parameter $ϕ$
4:
Initialize main critic networks $Q_{θ_{i}}$ with random parameters $θ_{i}$ for $i \in \{1, 2\}$
5:
Initialize target critic networks $Q_{{\hat{θ}}_{i}}$ with parameters ${\hat{θ}}_{i}$ as main critic networks $Q_{θ_{i}}$
6:
for each training episode do
7:
Observe initial state $s_{t}$
8:
for each step $t = 1, 2, . . ., T$ do
9:
Generate the action $a_{t} = π_{ϕ} (s_{t})$
10:
Execute the action $a_{t}$
11:
Observe the next state $s_{t + 1}$ and the reward $r_{t}$
12:
Store the experience $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ at the experience buffer D
13:
Sample a mini-batch $\bar{D}$ of a few experiences from the buffer D
14:
Calculate the target state value $V_{\hat{θ}} (s^{'})$ based on Equation (9)
15:
Update the main critic network $Q_{θ_{i}}$ based on the gradient $\nabla_{θ_{i}} J_{Q} (θ_{i})$ in Equation (10)
16:
Update the actor network $π_{ϕ}$ based on the gradient $\nabla_{ϕ} J_{π} (ϕ)$ in Equation (12)
17:
Update the weight $α$ based on the gradient $\nabla_{α} J (α)$ in Equation (13)
18:
For every B steps, use soft update for the target critic networks based on Equation (14)
19:
end for
20:
end for