Joint Optimization for Mobile Edge Computing-Enabled Blockchain Systems: A Deep Reinforcement Learning Approach

View full-text article in PMC

. 2022 Apr 22;22(9):3217. doi: 10.3390/s22093217

Algorithm 1 DDPG-based Optimization Framework for MEC-enabled Blockchain IoT Systems.

1:
for each GN $m \in Φ_{G}$ do
2:
Initialization: replay memory $B_{m}$ , critic network $Q (s, a | θ_{m}^{Q})$ , actor network $μ (s | θ_{m}^{μ})$ and corresponding target networks $Q^{'}$ and $μ^{'}$ with weights $θ_{m}^{μ^{'}} \leftarrow θ_{m}^{μ}$ and $θ_{m}^{Q^{'}} \leftarrow θ_{m}^{Q}$ ;
3:
end for
4:
for each episode $\in {1, 2, \dots, K_{m a x}}$ do
5:
Initialization: state $s_{m, 1}$ for each GN $m \in Φ_{G}$ ;
6:
for each decision epoch $n = 1, 2, \dots, T_{m a x}$ do
7:
for each GN $m \in Φ_{G}$ do
8:
Select action $a_{m, n} = μ (s_{m, n} | θ_{m}^{μ}) + Δ μ$ based on the exploration noise $Δ μ$ to decide block interval and power allocation;
9:
Observe reward $r_{m, n}$ and next state $s_{m, n + 1}$ ;
10:
Store transition data $(s_{m, n}, a_{m, n}, r_{m, n},$ $s_{m, n + 1})$ into replay memory $B_{m}$ ;
11:
Sample a mini-batch of Z transition tuples ${(s_{z}, a_{z}, r_{z}, s_{z}^{'})}_{z = 1}^{Z}$ from memory $B_{m}$ at random;
12:
Update critic network by minimizing the loss L:
$L = \frac{1}{Z} \sum_{z = 1}^{Z} {(Q (s_{z}, a_{z} | θ_{m}^{Q}) - ϵ_{z})}^{2};$
13:
Update actor policy based on the sampled policy gradient:
$\nabla_{θ_{m}^{μ}} J \approx \frac{1}{Z} \sum_{z = 1}^{Z} \nabla_{a} Q (s_{z}, a | θ_{m}^{Q}) |_{a = μ (s_{z})} \nabla_{θ_{m}^{μ}} μ (s_{z} | θ_{m}^{μ});$
14:
Update target networks:
$θ_{m}^{μ^{'}} \leftarrow ζ θ_{m}^{μ} + (1 - ζ) θ_{m}^{μ^{'}}$

$θ_{m}^{Q^{'}} \leftarrow ζ θ_{m}^{Q} + (1 - ζ) θ_{m}^{Q^{'}}$
15:
end for
16:
end for
17:
end for