Algorithm 1.

Training LassoNet

1:	Input: training dataset $X \in ℝ^{n \times d}$ , training labels Y, feed-forward neural network f_W(·), number of epochs B, hierarchy multiplier M, path multiplier ϵ, learning rate α
2:	Initialize and train the feed-forward network on the loss L(X, Y; θ, W)
3:	Initialize the penalty, λ = ϵ, and the number of active features, k = d
4:	while k > 0 do
5:	Update λ ← (1 + ϵ)λ
6:	for b ∈ {1... B} do
7:	Compute gradient of the loss w.r.t to (θ, W) using back-propagation
8:	Update θ ← θ − α∇_θL and W ← W − α∇_W L
9:	Update (θ, W⁽¹⁾) ← Hier-Prox(θ, W⁽¹⁾, λ, M)
10:	end for
11:	Update k to be the number of non-zero coordinates of θ
12:	end while
13:	where Hier-Prox is defined in Alg. 2