|
Algorithm A1 PDC-KD Training Procedure |
-
1:
Input: Dataset D, teacher model T, student model S
-
2:
Initialize student parameters ; set Cholesky factor
-
3:
Load precomputed Teacher Fisher matrix and projection operator
-
4:
for epoch to 200 do
-
5:
if epoch then
-
6:
Apply linear warm-up
-
7:
end if
-
8:
for batch do
-
9:
Compute teacher output
-
10:
Compute student output
-
11:
Apply SG filtering to obtain
-
12:
Compute equivalent inertia tensor
-
13:
Compute
-
14:
Aggregate total loss (Equation (10))
-
15:
Perform backpropagation with gradient clipping to update and L
-
16:
end for
-
17:
end for
-
18:
Output: Trained student model S
|