Communication-efficient federated learning via knowledge distillation

. 2022 Apr 19;13:2032. doi: 10.1038/s41467-022-29763-x

1: Setting the mentor learning rate η_t and mentee learning rate η_s, client number N

2: Setting the hyperparameters T_start and T_end

3: for each client i (in parallel) do

4: Initialize parameters

Θ_{i}^{t}

, Θ^s

5: repeat

g_{i}^{t}

,g_i=LocalGradients(i)

Θ_{i}^{t} \leftarrow Θ_{i}^{t} - η_{t} g_{i}^{t}

8: g_i ← U_i∑_iV_i

9: Clients encrypt U_i, ∑_i, V_i

10: Clients upload U_i, ∑_i, V_i to the server

11: Server decrypts U_i, ∑_i, V_i

12: Server reconstructs g_i

13: Global gradients g ← 0

14: for each client i (in parallel) do

15: g = g + g_i

16: end for

17: g ← U∑V

18: Server encrypts U, ∑, V

19: Server distributes U, ∑, V to user clients

20: Clients decrypt U, ∑, V

21: Clients reconstructs g

22: Θ^s ← Θ^s − η_sg/N

23: until Local models converges

24: end for

LocalGradients(i):

25: Compute task losses

L_{t, i}^{t}

and

L_{s, i}^{t}

26: Compute losses

L_{t, i}^{d}

L_{s, i}^{d}

L_{t, i}^{h}

, and

L_{s, i}^{h}

27:

L_{i}^{t} \leftarrow L_{t, i}^{t} + L_{t, i}^{d} + L_{t, i}^{h}

28:

L_{i}^{s} \leftarrow L_{s, i}^{t} + L_{s, i}^{d} + L_{s, i}^{h}

29: Compute local mentor gradients

g_{i}^{t}

from

L_{i}^{t}

30: Compute local mentee gradients g_i from

L_{i}^{s}

31: return

g_{i}^{t}, g_{i}