1: |
Input: training dataset , training labels Y, feed-forward neural network fW(·), number of epochs B, hierarchy multiplier M, path multiplier ϵ, learning rate α
|
2: |
Initialize and train the feed-forward network on the loss L(X, Y; θ, W) |
3: |
Initialize the penalty, λ = ϵ, and the number of active features, k = d
|
4: |
while
k > 0 do
|
5: |
Update λ ← (1 + ϵ)λ
|
6: |
for
b ∈ {1... B} do
|
7: |
Compute gradient of the loss w.r.t to (θ, W) using back-propagation |
8: |
Update θ ← θ − α∇θL and W ← W − α∇W L
|
9: |
Update (θ, W(1)) ← Hier-Prox(θ, W(1), λ, M) |
10: |
end for
|
11: |
Update k to be the number of non-zero coordinates of θ
|
12: |
end while
|
13: |
where Hier-Prox is defined in Alg. 2 |