Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 28.
Published in final edited form as: J Am Stat Assoc. 2022 Mar 27;118(544):2315–2328. doi: 10.1080/01621459.2022.2044824

Understanding Implicit Regularization in Over-Parameterized Single Index Model

Jianqing Fan 1, Zhuoran Yang 2, Mengxin Yu 3
PMCID: PMC10977662  NIHMSID: NIHMS1782584  PMID: 38550788

Abstract

In this paper, we leverage over-parameterization to design regularization-free algorithms for the high-dimensional single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In addition, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both 2-statistical rate and variable selection consistency.

Keywords: Implicit Regularization, high-dimensional models, over-parameterization, single-index models

1. Introduction

With the astonishing empirical success in various application domains such as computer vision [Voulodimos et al., 2018], natural language processing [Otter et al., 2020, Torfi et al., 2020], and reinforcement learning [Arulkumaran et al., 2017, Li, 2017], deep learning [LeCun et al., 2015, Goodfellow et al., 2016, Fan et al., 2021a] has become one of the most prevalent classes of machine learning methods. When applying deep learning to supervised learning tasks such as regression and classification, the regression function or classifier is represented by a deep neural network, which is learned by minimizing a loss function of the network weights. Here the loss function is defined as the empirical risk function computed based on the training data and the optimization problem is usually solved by gradient-based optimization methods. Due to the nonlinearity of the activation function and the multi-layer functional composition, the landscape of the loss function is highly nonconvex, with many saddle points and local minima [Dauphin et al., 2014, Swirszcz et al., 2016, Yun et al., 2019]. Moreover, oftentimes the neural network is over-parameterized in the sense that the total number of network weights exceeds the number of training data, making the regression or classification problem ill-posed from a statistical perspective. Surprisingly, however, it is often observed empirically that simple algorithms such as (stochastic) gradient descent tend to find the global minimum of the loss function despite non-convexity. Moreover, the obtained solution also generalizes well to unseen data with small test error [Neyshabur et al., 2015, Zhang et al., 2017]. These mysterious observations cannot be fully explained by the classical theory of nonconvex optimization and generalization bounds via uniform convergence.

To understand such an intriguing phenomenon, Neyshabur et al. [2015], Zhang et al. [2017] show empirically that the generalization stems from an “implicit regularization” of the optimization algorithm. Specifically, they observe that, in over-parametrized statistical models, although the optimization problems consist of bad local minima with large generalization error, the choice of optimization algorithm, usually a variant of gradient descent algorithm, usually guard the iterates from bad local minima and prefers the solution that generalizes well. Thus, without adding any regularization term in the optimization objective, the implicit preference of the optimization algorithm itself plays the role of regularization. Implicit regularization has been shown indispensable in training deep learning models [Neyshabur et al., 2015, 2017, Zhang et al., 2017, Keskar et al., 2017, Poggio et al., 2017, Wilson et al., 2017].

With properly designed algorithm, Gunasekar et al. [2017] and Li et al. [2018] provide empirical evidence and theoretical guarantees for the implicit regularization of gradient descent for least-squares regression with a two-layer linear neural network, i.e., low-rank matrix sensing. They show that gradient descent biases towards the minimum nuclear norm solution when the initialization is close to the origin, stepsizes are sufficiently small, and no explicit regularization is imposed. More specifically, when the true parameter is a rank r positive-semidefinite matrix in d×d, they rewrite the parameter as UU, where Ud×d, and propose to estimate the true parameter by updating U via gradient descent. Li et al. [2018] proves that, with O˜(r2d) i.i.d. observations of the model, gradient descent provably recovers the true parameter with accuracy, where O˜() hides absolute constants and polylogarithmic terms. Thus, in over-parametrized matrix sensing problems, the implicit regularization of gradient descent can be viewed as equivalent to adding a nuclear norm penalty explicitly. See also Arora et al. [2019] for a related topic on deep linear network.

Moreover, Zhao et al. [2019], Vaškevičius et al. [2019] recently design a noval regularization-free algorithm and study the implicit regularization of gradient descent for high-dimensional linear regression with a sparse signal parameter, which is a vector in p with s nonzero entries. They propose to re-parametrize the parameter using two vectors in p via the Hadamard product and estimate the true parameter via un-regularized gradient descent with proper initialization, stepsizes, and the number of iterations. They prove independently that, with n=O(s2 log p) i.i.d. observations, gradient descent yields an estimator of the true parameter with the optimal statistical accuracy. More interestingly, when the nonzero entries of the true parameter all have sufficiently large magnitude, the proposed estimator attains the oracle O(s log s/n) rate that is independent of the ambient dimension p. Hence, for sparse linear regression, the implicit regularization of gradient descent has the same effect as the folded concave penalties [Fan et al., 2014] such as smoothly clipped absolute deviation (SCAD) [Fan and Li, 2001] and minimax concave penalty (MCP) [Zhang et al., 2010].

The aforementioned works all design algorithms and establish theoretical results for linear statistical models with light-tailed noise, which is slightly restricted since linear models with sub-Gaussian noise only comprise a small proportion of the models of interest in statistics. For example, in the field of finance, linear models only bring limited contributions and the datasets are always corrupted by heavy-tailed noise. Thus, one questions is left open:

Can we leverage over-parameterization and implicit regularization to establish statistically accurate estimation procedures for a more general class of high-dimensional statistical models with possibly heavy-tailed data?

In this work, we focus on the single index model, where the response variable Y and the covariate X satisfy Y = f(〈X, β*〉) + ϵ, with β* being the true parameter, ϵ being the random noise, and f: being an unknown (nonlinear) link function. Here β* is either a s-sparse vector in p or a rank r matrix in d×d. Since f is unknown, the norm of β is not identifiable. Thus, for the vector and matrix cases respectively, we further assume that the 2- or Frobenius norms of β* are equal to one. Our goal is to recover the true parameter β* given n i.i.d. observations of the model. Such a model can be viewed as the misspecified version of the compressed sensing [Donoho, 2006, Candés, 2008] and phase retrieval [Shechtman et al., 2015, Candés et al., 2015] models, which corresponds to the identical and quadratic link functions respectively.

In a single index model, due to the unknown link function, it is infeasible to directly estimate β* via nonlinear least-squares. Moreover, jointly minimizing the least-squares loss function with respect to β* and f is computationally intractable. To overcome these challenges, a recent line of research proposes to estimate β* by the method of moments when the distribution of X is known. This helps us provide a deep understanding on the implicit regularization induced by over-parameterization in the nonlinear models without excessive technicality and eliminate other complicated factors that convolve insights. Specifically, when X is a standard Gaussian random variable, Stein’s identity [Stein et al., 1972] implies that the expectation of Y · X is proportional to β*. Thus, despite the nonlinear link function, β* can be accurately estimated by neglecting f and fitting a regularized least-squares regression. In particular, when β* is a sparse vector, Plan and Vershynin [2016], Plan et al. [2017] prove that the Lasso estimator achieves the optimal statistical rate of convergence. Subsequently, such an approach has been extended to the cases beyond Gaussian covariates. In particular, Goldstein et al. [2018], Wei [2018], Goldstein and Wei [2019] allow the covariates to follow an elliptically symmetric distribution that can be heavy-tailed. In addition, utilizing a generalized version of Stein’s identity [Stein et al., 2004], Yang et al. [2017] extends the Lasso approach to the setting where the covariate X has a known density p0. Specifically, when p0 is known, we can define the score function Sp0() as Sp0()= log p0(), which enjoys the property that E[YSp0(X)] identifies the direction of β*. Thus, the true parameter can be estimated via M-estimation with Sp0(X) serving as the covariate.

To answer the question given above, in this work, we leverage over-parameterization to design regularization-free algorithms for single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. To be more specific, we first adopt the quadratic loss function in Yang et al. [2017] and rewrite the parameter of interest by over-parameterization. When β* is a sparse vector in p, we adopt a Hadamard product parameterization [Hoff, 2017, Zhao et al., 2019, Vaškevičius et al., 2019] and write β* as wwvv, where both w and v are vectors in p. We propose to minimize the loss function as a function of the new parameters via gradient descent, where both w and v are initialized near an all-zero vector and the stepsizes are fixed to be a sufficiently small constant η > 0. Furthermore, when β* is a low-rank matrix, we similarly represent β* as WWVV and propose to recover β* by applying the gradient descent algorithm to the quadratic loss function under the new parameterization.

Furthermore, the analysis of our algorithm faces the following two challenges. First, due to over-parameterization, there exist exponentially many stationary points of the population loss function that are far from the true parameter. Thus, it seems that the gradient descent algorithm would be likely to return a stationary point that incurs a large error. Second, both the response Y and the score Sp0(X) can be heavy-tailed random variables. Thus, the gradient of the empirical loss function can deviate significantly from its expectation, which poses an additional challenge to establishing the statistical error of the proposed estimator.

To overcome these difficulties, in our algorithm, instead of estimating E[YSp0(X)] by its empirical counterpart, we construct robust estimators via proper truncation techniques, which have been widely applied in high-dimensional M-estimation problems with heavy-tailed data [Fan et al., 2021c, Zhu, 2017, Wei and Minsker, 2017, Minsker, 2018, Fan et al., 2021b, Ke et al., 2019, Minsker and Wei, 2020]. These robust estimators are then employed to compute the update directions of the gradient descent algorithm. Moreover, despite the seemingly perilous loss surface, we prove that, when initialized near the origin and sufficiently small stepsizes, the gradient descent algorithm guard the iterates from bad stationary points. More importantly, when the number of iterations is properly chosen, the obtained estimator provably enjoys (near-)optimal O(s log p/n)  and O(rd log d/n) 2-statistical rates under the sparse and low-rank settings, respectively. Moreover, for sparse β*, when the magnitude of the nonzero entries is sufficiently large, we prove that our estimator enjoys an oracle O(s log n/n) 2-statistical rate, which is independent of the dimensionality p. Our proof is based on a jointly statistical and computational analysis of the gradient descent dynamics. Specifically, we decompose the iterates into a signal part and a noise part, where the signal part share the same sparse or low-rank structures as the true signal and the noise part are orthogonal to the true signal. We prove that the signal part converges to the true parameter efficiently whereas the noise part accumulates at a rather slow rate and thus remains small for a sufficiently large number of iterations. Such a dichotomy between the signal and noise parts characterizes the implicit regularization of the gradient descent algorithm and enables us to establish the statistical error of the final estimator.

Furthermore, our method has several merits compared with classical regularized methods. From the theoretical perspective, our strengths are two-fold. First, as we mentioned in the last paragraph, under mild conditions, our estimator enjoys oracle statistical rate whereas the most commonly used 1-regularized method always results in large bias. In this case, our method is equivalent with adding folded-concave regularizers (e.g. SCAD, MCP) to the loss function. Second, for all estimators inside the wide optimal time interval, our range of choosing the truncating parameter to achieve variable selection consistency (rank consistency) is much wider than classical regularized methods. Thus, our method is more robust than all regularized methods in terms of selecting the truncating parameter. Meanwhile, from the aspect of applications, our strengths are three-fold. First, in terms of 2-statistical rate, numerical studies show that our method generalizes even better than adding folded-concave penalties. Second, from the aspect of variable selection, experimental results also show that the robustness of our method helps reduce false positive rates greatly. Last but not least, as we only need to run gradient descent and the gradient information is able to be efficiently transferred among different machines, our method is easier to be paralleled and generalized to large-scale problems. Thus, our method can be applied to modern machine learning applications such as federated learning.

To summarize, our contribution is several-fold. First, for sparse and low-rank single index models where the random noise is possible heavy-tailed, we employ a quadratic loss function based on a robust estimator of E[YSp0(X)] and propose to estimate β* by combining over-parameterization and regularization-free gradient descent. Second, we prove that, when the initialization, stepsizes, and stopping time of the gradient descent algorithm are properly chosen, the proposed estimator achieves optimal statistical rates of convergence up to logarithm terms under both the sparse and low-rank settings. This captures the implicit regularization phenomenon induced by our algorithm. Third, in order to corroborate our theories, we did extensive numerical studies. The experimental results support our theoretical findings and also show that our method outperforms classical regularized methods in terms of both 2-statistical rates and variable selection consistency.

1.1. Related Works

Our work belongs to the recent line of research on understanding the implicit regularization of gradient-based optimization methods in various statistical models. In addition, our work is also closely related to the large body of literature on single index models. Due to the space limit, we defer the discussions on related works to Appendix A in the supplement.

1.2. Notation

In this subsection, we give an introduction to our notations. Throughout this work, we use [n] to denote the set {1, 2, …, n}. For a subset S in [n] and a vector u, we use uS to denote the vector whose i-th entry is ui if iS and 0 otherwise. For any vector u and q ⩾ 0, we use uq to represent the vector q norm. In addition, the inner product 〈u, v〉 between any pair of vectors u, v is defined as the Euclidean inner product uv. Moreover, we define uv as the Hardmard product of vectors u, v. For any given matrix Xd1×d2, we use ∥Xop, ∥XF and ∥X* to represent the operator norm, Frobenius norm and nuclear norm of matrix X respectively. In addition, for any two matrices X, Yd1×d2, we define their inner product 〈X, Y〉 as 〈X, Y〉 = tr(XY). Moreover, if we write X ≽ 0 or X ≼ 0, then the matrix X is meant to be positive semidefinite or negative semidefinite. We let {an, bn}n⩾1 be any two positive series. We write anbn if there exists a universal constant C such that anC · bn and we write anbn if an/bn → 0. In addition, we write anbn, if we have anbn and bnan and the notations of an=O(bn) and an = o(bn) share the same meaning with anbn and anbn. Moreover, an=O˜(bn) means anCbn up to some logarithm terms. Finally, we use an = Ω(bn) if there exists a universal constant c > 0 such that an/bnc and we use an = Θ(bn) if can/bnC where c, C > 0 are universal constants.

1.3. Roadmap

The organization of our paper is as follows. We introduce the background knowledge in §2. In §3 and §4 we investigate the implicit regularization effect of gradient descent in over-parameterized SIM under the vector and matrix settings, respectively. Extensive simulation studies are presented in §B to corroborate our theory.

2. Preliminaries

In this section, we introduce the phenomenon of implicit regularization via over-parameterization, high dimensional single index model, and generalized Stein’s identity [Stein et al., 2004].

2.1. Related Works on Implicit Regularization

Both Gunasekar et al. [2017] and Li et al. [2018] have studied least squares objectives over positive semidefinite matrices βd×d of the following form

min β0F(β)=1ni=1n(yiXi,β)2, (1)

where the labels {yi}i=1n are generated from linear measurements yi = 〈Xi, β*〉, i ∈ [n], with β*d×d being positive semidefinite and low rank. Here β* is of rank r where r is much smaller than d.

Instead of working on parameter β directly, they write β as β = UU where Ud×d, and study the optimization problem related to U,

min Ud×df(U)=12ni=1n(yiXi,UU)2. (2)

The least-squares problem in (2) is over-parameterized because here β is parameterized by U, which has d2 degrees of freedom, whereas β*, being a rank-r matrix, has O(rd) degrees of freedom. Gunasekar et al. [2017] proves that when {Xi}i=1m are commutative and U is properly initialized, if the gradient flow of (2) converges to a solution U^ such that β^=U^U^ is a globally optimal solution of (1), then U^ has the minimum nuclear norm over all global optima. Namely,

β^argminβ0β*,subject to Xi,β^=yi,    i[n].

Subsequently, Li et al. [2018] assumes {Xi}i=1n satisfy the restricted isometry property (RIP) condition [Candés, 2008] and proves that by applying gradient descent to (2) with the initialization close to zero and sufficiently small fixed stepsizes, the near exact recovery of β* is achieved.

Recently, Li et al. [2021] proves that the algorithm of gradient flow with infinitesimal initialization on the general covariate of (2) tends to be equivalent to the Greedy Low-Rank Learning (GLRL) algorithm, which is a greedy rank minimization algorithm. Results in Gunasekar et al. [2017] with commutable {Xi}i=1m serves as a special case to Li et al. [2021].

As for noisy statistical model, both Zhao et al. [2019] and Vaškevičius et al. [2019] study over-parameterized high dimensional noisy linear regression problem independently. Specifically, here the response variables {yi}i=1n are generated from a linear model

yi=xiβ*+ϵi,i[n], (3)

where β*p and {ϵi}i=1n are i.i.d. sub-Gaussian random variables that are independent with the covariates {xi}i=1n. Moreover, here β* has only s nonzero entries where sp. Instead of adding sparsity-enforcing penalties, they propose to estimate β* via gradient descent with respect to w, v on a loss function L,

minwp,vpL(w,v)=12ni=1n[xi(wwvv)yi]2, (4)

where the parameter β is over-parameterized as β = wwvv. Under the restricted isometry property (RIP) condition on the covariates, these works prove that, when the hyperparameters is proper selected, gradient descent on (4) finds an estimator of β* with optimal statistical rate of convergence.

2.2. High Dimensional Single Index Model

In this subsection, we first introduce the score functions associated with random vectors and matrices, which are utilized in our algorithms. Then we formally define the high dimensional single index model (SIM) in both the vector and matrix settings.

Definition 1. Let xRp be a random vector with density function p0(x):p. The score function Sp0(x):pp associated with x is defined as

Sp0(x)xlog p0(x)=xp0(x)/p0(x).

Here the score function Sp0(x) relies on the density function p0(x) of the covariate x. In order to simplify the notations, in the rest of the paper, we omit the subscript p0 from Sp0 when the underlying distribution of x is clear to us.

Remark: If the covariate Xd×d is a random matrix whose entries are i.i.d. with a univariate density p0(x):, we then define the score function S(X)d×d entrywisely. In other words, for any {i, j} ∈ [d] × [d], we obtain

S(X)i,jp0(Xi,j)/p0(Xi,j). (5)

Next, we introduce the first-order general Stein’s identity.

Lemma 1. (First-Order General Stein’s Identity, [Stein et al., 2004]) We assume that the covariate xp follows a distribution with density function p0(x):p which is differentiable and satisfies the condition that |p0(x)| converges to zero asx2 goes to infinity. Then for any differentiable function f(x) with E[|f(x)S(x)|]E[xf(x)2]<, it holds that,

E[f(x)S(x)]=E[xf(x)],

where S(x) = −∇xp0(x)/p0(x) is the score function with respect to x defined in Definition 1.

Remark: In the case of having matrix covariate, we are able to achieve the same conclusion by simply replacing xp by Xd×d in Lemma 1 with the definition of matrix score function in (5).

In the sequel, we introduce the single index models considered in this work. We first define sparse vector single index models as follows.

Definition 2. (Sparse Vector SIM) We assume the response Y is generated from model

Y=f(x,β*)+ϵ, (6)

with unknown link f:, p-dimensional covariate x as well as signal β* which is the parameter of interest. Here, we let ϵ be an exogenous random noise with mean zero. In addition, if not particularly indicated, we assume the entries of x are i.i.d. random variables with a known univariate density p0(x). As for the underlying true signal β*, it is assumed to be s-sparse with sp. Note that the length of β* can be absorbed by the unknown link f, we then letβ*∥2 = 1 for model identifiability.

By the definition of sparse vector SIM, we notice that many well-known models are included in this category, such as linear regression yi=xiβ*+ϵ, phase retrieval yi=(xiβ*)2+ϵ, as well as one-bit compressed sensing y=sign(xiβ*)+ϵ.

Finally, we define the low rank matrix SIM as follows.

Definition 3. (Symmetric Low Rank Matrix SIM) For the low rank matrix SIM, we assume the response Y is generated from

Y=f(X,β*)+ϵ, (7)

in which β*d×d is a low rank symmetric matrix with rank rd and the link function f is unknown. For the covariate Xd×d, we assume the entries of X are i.i.d. with a known density p0(x). Besides, sinceβ*∥F can be absorbed in the unknown link function f, we further assumeβ*∥F = 1 for model identifiability. In addition, the noise term ϵ is also assumed additive and mean zero.

As we have discussed in the introduction, almost all existing literature designs algorithms and studies the corresponding implicit regularization phenomenon in linear models with sub-Gaussian data. The scope of this work is to leverage over-parameterization to design regularization-free algorithms and delineate the induced implicit regularization phenomenon for a more general class of statistics models with possibly heavy-tailed data. Specifically, in §3 and §4, we design algorithms and capture the implicit regularization induced by the gradient descent algorithm for over-parameterized vector and matrix SIMs, respectively.

3. Main Results for Over-Parameterized Vector SIM

Leveraging our conclusion from Lemma 1 as well as our definition of sparse vector SIM in Definitions 2, we have

E[YS(x)]=E[f(x,β*)S(x)]=E[f(x,β*)]β*μ*β*,

which recovers our true signal β* up to scaling. Here we define μ*=E[f(x,β*)], which is assumed nonzero throughout this work. Hence, Y · S(x) serves as an unbiased estimator of μ*β*, and we can correctly identify the direction of β* by solving a population level optimization problem:

min βL(β)β,β2β,E[YS(x)].

Since we only have access to finite data, we replace E[YS(x)] by its sample version estimator 1ni=1nyiS(xi), and plug the sample-based estimator into the loss function. In a high dimensional SIM given in Definition 2, where the true signal β* is assumed to be sparse, various works [Plan and Vershynin, 2016, Plan et al., 2017, Yang et al., 2017] have shown that the 1-regularized estimator β^ given by

β^argmin βL(β)β,β2β,1ni=1nyiS(xi)+λβ1 (8)

attains the optimal statistical rate of convergence rate to μ*β*.

In contrast, instead of imposing an 1-norm regularization term, we propose to obtain an estimator by minimizing the loss function L directly, with β re-parameterized using two vectors w and v in p. Specifically, we write β as β = wwvv and thus equivalently write the loss function L(β) as L(w, v), which is given by

L(w,v)=wwvv,wwvv2wwvv,1ni=1nyiS(xi). (9)

Note that the way of writing β in terms of w and v is not unique. In particular, β has p degrees of freedom but we use 2p parameters to represent β. Thus, by using w and v instead of β, we employ over-parameterization in (9).

We briefly describe our motivation on over-parameterizing β by wwvv. Suppose that β is sparse, an explicit regularization is to use 1-penalty. Note that ∥β1 = minγδ=β{∥γ2 + ∥δ2}/2, where ⊙ denotes the Hadamard (componentwise) product. Thus, an explicit regularization is to minγ,δi=1n{Yif(xiTγδ)}2+λ{γ2+δ2} for a penalty parameter λ, following the method in Hoff [2017]. To gain understanding on implicit regularization by over parametrization, we let w = (γ + δ)/2 and v = (γδ)/2. Then β = γδ = wwvv with 2p new parameters w and v that over parameterize the problem. This leads to the empirical loss L(w,v)=i=1n{Yif(xiT(wwvv))}2. Following the neural network training, we drop the explicit penalty and run the gradient decent to minimize L(w, v).

To be more specific, for the sparse SIM, we propose to construct an estimator of β* by applying gradient descent to L in (9) with respect to w and v, without any explicit regularization. Such an estimator, if achieves desired statistical accuracy, demonstrates the efficacy of implicit regularization of gradient descent in over-parameterized sparse SIM. Specifically, the gradient updates for the vector (w, v) for solving (9) are given by

wt+1=wtηwL(wt,vt)=wtη(wtwtvtvt1ni=1nS(xi)yi)wt, (10)
vt+1=vt+ηvL(wt,vt)=vt+η(wtwtvtvt1ni=1nS(xi)yi)vt. (11)

Here η > 0 is the stepsize. By the parameterization of β, {wt, vt}t⩾0 leads to a sequence of estimators {βt}t⩾0 given by

βt+1=wt+1wt+1vt+1vt+1. (12)

Meanwhile, in terms of chooisng initial values, since the zero vector is a stationary point of the algorithm, we cannot set the initial values of w and v to the zero vector. To utilize the structure of β*, ideally we would like to initialize w and v such that they share the same sparsity pattern as β*. That is, we would like to set the entries in the support of β* to nonzero values, and set those outside of the support to zero. However, such an initialization scheme is infeasible since the support of β* is unknown. Instead, we initialize w0 and v0 as w0 = v0 = α · 1p×1, where α > 0 is a small constant and 1p×1 is an all-one vector in p. By setting w0 = v0, we equivalently set β0 to the zero vector. And more importantly, such a construction provides a good compromise: zero components get nearly zero initializations, which are the majority under the sparsity assumption, and nonzero components get nonzero initializations. Even though we initialize every component at the same value, the nonzero components move quickly to their stationary component, while zero components remain small. This is how over-parameterization differentiate active components from inactive components. We illustrate this by a simulation experiment.

A simulation study.

In this simulation, we fix sample size n = 1000, dimension p = 2000, number of non-zero entries s = 5. Let S{i:|βi*|>0}. The responses {yi}i=1n are generated from yi = f(〈x, β*〉) + ϵi, i ∈ [n] with link functions f1(x) = x (linear regression) and f2(x) = sin(x). Here we assume β* is s-sparse with βi=1/s, iS, and {xi}i=1n are standard Gaussian random vectors. We over-parameterize β as wwvv and set w0 = v0 = 10−5 · 1p×1. Then we update w, v and β regarding equations (10), (11), and (12) with stepsize η = 0.01. The evolution of the distance between our unnormalized iterates βt and μ*β*, trajectories of βj,t for jS and maxjSc|βj,t| are depicted in Figures 1 and 2.

Figure 1:

Figure 1:

With link function f(x) = x, (a) characterizes the evolution of distance βtμ*β*22 against iteration number t; (b) depicts the trajectories βj,t (jS) for five nonzero components, and (c) presents the trajectory maxjSc|βj,t|.

Figure 2:

Figure 2:

With link function f(x) = sin(x), similar to Figure 1, here (a) characterizes the evolution of distance βtμ*β*22 against iteration number t; (b) depicts the trajectories βj,t (jS) for five nonzero components, and (c) presents the trajectory maxjSc|βj,t|.

From the simulation results given in Figure 1-(a) and Figure 2-(a), we notice that there exists a time interval, where we can nearly recover μ*β*. From plots (b) in Figures 1 and 2, we can see with over-parameterization, five nonzero components all increase rapidly and converge quickly to their stationary points. Meanwhile, the maximum estimation error for inactive component, represented by βSc,t, still remains small, as shown in Figure 1-(c) and Figure 2-(c). In other words, running gradient descent with respect to over-parameterized parameters helps us distinguish non-zero components from zero components, while applying gradient descent to the ordinary loss can not.

It is worth noting that, with over-parameterization, there are Ω(2p) stationary points of L satisfying ∇wL(w, v) = ∇vL(w, v) = 0p×1, where 0p×1 is the zero vector. To see this, for any subset I ⊆ [p], we define vectors w¯ and v¯ as follows. For any jI, we set the j-th entries of w¯ and v¯ to zero. Meanwhile, for any jI, we choose w¯j and v¯j such that w¯j2v¯j2=n1i=1nS(xi)jyi, where w¯j , v¯j, and S(xi)j are the j-th entries of w¯, v¯, and S(xi), respectively. By direct computation, it can be shown that (w¯,v¯) is a stationary point of L, and thus there are at least 2p stationary points. However, our numerical results demonstrate that not all of these stationary points are likely to be found by the gradient descent algorithm — gradient descent favors the stationary points that correctly recover μ*β*. Such an intriguing observation captures the implicit regularization induced by the optimization algorithm and over-parameterization.

3.1. Gaussian Design

In this subsection, we discuss over-parameterized SIM with Gaussian covariates. In this subsection, we assume the distribution of x in (6) is N(μ, Σ), where both μ and Σ are assumed known. Moreover, only in this subsection, we slightly modify the identifiability condition in Definition 2 from assuming ∥β*∥2 = 1 to ∥Σ1/2β*∥2 = 1.

3.1.1. Theoretical Results for Gaussian Covariates

We first introduce an structural assumption on the SIM.

Assumption 1. Assume that μ*=E[f(x,β*)]0 is a constant and the following two conditions hold.

  1. Covariance matrix Σ is positive-definite and has bounded spectral norm. To be more specific, there exist constants Cmin and Cmax such that CminIp×pCmaxIp×p holds, where Ip×p is the identity matrix.

  2. Both {f(xi,β*)}i=1n and {ϵi}i=1n are i.i.d. sub-Gaussian random variables, with the sub-Gaussian norms denoted by fψ2=O(1)  and σ=O(1) respectively. Here we let fψ2 denote the sub-Gaussian norm of f(〈xi, β*〉). In addition, we further assume that |μ*|/fψ2=Θ(1), |μ*|/σ = Ω(1).

The score function for the Gaussian distribution N(μ, Σ) is S(x) = Σ−1(xμ) and Assumption 1-(a) makes the Gaussian distributed covariates non-degenerate. Assumption 1-(b) enables the the empirical estimator n1i=1nyiS(xi) to concentrate to its expectation μ*β*, and also sets a lower bound to the signal noise ratio |μ*|/σ. Note that this assumption is quite standard and easy to be satisfied by a broad class of models as long as there exists a lower bound on the signal noise ratio, which include models with link functions f(x) = x, sin x, tanh(x), and etc. In addition, in §3.2, the assumption that both f(〈x, β*〉) and the noise ϵ are sub-Gaussian random variables will be further relaxed to simply assuming they have bounded finite moments with perhaps heavy-tailed distributions.

We present the details of the proposed method for the Gaussian case in Algorithm 1. In the following, we present the statistical rates of convergence for the estimator constructed by Algorithm 1. Let us divide the support seta S={i:|βi*|>0} into S0={i:|βi|Cslog p/n} and S1={i:0<|β*|<Cslog p/n}, which correspond to the sets of strong and weak signals, respectively. Here Cs is an absolute constant. We let s0 and s1 be the cardinalities of S0 and S1, respectively. In addition, we let sm=miniS0|βi*| be the smallest value of strong signals.

Theorem 1. Apart from Assumption 1, if we further let our initial value α satisfy 0<αM02/p and set stepsize η as 0 < η ⩽ 1/(12(|μ*| + M0)) in Algorithm 1 with M0 being a constant proportional to max{fψ2,σ}, there exist absolute constants a1, a2 > 0 such that, with probability at least 1 − 2p−1 − 2n−2, we have

βT1μ*β*22s0log nn+s1log pn,

for all T1[a1 log(1/α)/(η(|μ*|smM0log p/n)),a2 log(1/α)n/log p/(ηM0)]. Meanwhile, the statistical rate of convergence for the normalized iterates are given by

βT1Σ1/2βT12μ*β*|μ*|22s0 log nn+s1 log pn.

3.1.

Theorem 2. (Variable Selection Consistency) Under the setting of Theorem 1, for all

T1[a1 log(1/α)/(η(|μ*|smM0log p/n)),a2 log(1/α)n/log p/(ηM0)],

we let [β˜T1]i=[βT1]iI|[βT1]i|λ, for all i ∈ [p]. Then, with probability at least 1 – 2p−1 – 2n−2, for all λ[α,(Cs|μ*|2M0)log p/n], we have supp(β˜T1)supp(β*). Moreover, when there only exists strong signals in S0. We further have supp(β˜T1)=supp(β*) and sign(β˜T1)=sign(β*).

Theorem 1 shows that if we just have strong signals, then with high probability, for any T1[a1 log(1/α)/(η(|μ*|smM0log p/n)),a2 log(1/α)n/log p/(ηM0)], we get the oracle statistical rate O(s log n/n) in terms of the 2-norm, which is independent of the ambient dimension p. Besides, when β* also consists of weak signals, we achieve O(s log p/n) statistical rate in terms of the 2-norm, where s is the sparsity of β*. Such a statistical rate matches the minimax rate of sparse linear regression [Raskutti et al., 2011] and is thus minimax optimal. Notice that the oracle rate is achievable via explicit regularization using folded concave penalties [Fan et al., 2014] such as SCAD [Fan and Li, 2001] and MCP [Zhang et al., 2010]. Thus, Theorem 1 shows that, with over-parameterization, the implicit regularization of gradient descent has the same effect as adding a folded concave penalty function to the loss function in (9) explicitly.

Furthermore, comparing our work to Plan and Vershynin [2016], Plan et al. [2017], which study high dimensional SIM with 1-regularization, thanks to the implicit regularization phenomenon, we avoid bias brought by the 1-penalty and attain the oracle statistical rate. Moreover, our another advantage over regularized methods is shown in Theorem 2. It shows that by properly truncating βT1 when T1 falls in the optimal time interval, we are able to recover the support of β* with high probability. Comparing to existing literatures on support recovery via using explicit regularization on single index model [Neykov et al., 2016], our method offers a wider range for choosing tuning parameter λ with a known left boundary α, instead of only using λ=Θ(log p/n). This efficiently reduces false discovery rate, see §D.1 for more details. Last but not least, as we only need to run gradient descent, comparing to regularized methods, it is easier to parallel our algorithm since the gradient information is able to be efficiently transferred among different machines. The use of implicit regularization allows our methodology to be generalized to large-scale problems easily [McMahan et al., 2017, Richards and Rebeschini, 2020, Richards et al., 2020]. The detailed discussions are given in §C.5.

Theorem 1 and Theorem 2 generalizes the results in Zhao et al. [2019] and Vaškevičius et al. [2019] for the linear model to high-dimensional SIMs. In addition, to satisfy the RIP condition, their sample complexity is at least O(s2 log p) if their covariate x follows the Gaussian distribution. Whereas, by using the loss function in (9) motivated by the Stein’s identity [Stein et al., 1972, 2004], the RIP condition is unnecessary in our analysis. Instead, our theory only requires that n1i=1nS(xi)yi concentrates at a fast rate. As a result, our sample complexity is O(s log p) for 2-norm consistency, which is better than O(s2 log p).

The ideas of proof behind Theorem 1 and Theorem 2 are as follows. First, we are able to control the strengths of error component, denoted by βt1Sc, at the same order with the square root of their initial values until O(log(1/α)n/log p/(ηM0)) steps. This gives us the right boundary of the stopping time T1. Meanwhile, every entry of strong signal part βt1S0 grows at exponential rates to ϵ=O(log n/n) accuracy around μ*β*1S0 within O(log(1/α)/(η(|μ*|smM0log p/n))) steps, which offers us the left boundary of the stopping time T1. Finally, we prove for weak signals, their strengths will not exceed O(log p/n) for all steps as long as we properly choose the stepsize. Thus, by letting the stopping time T1 be in the interval given in Theorem 1, we obtain converged signal component and well controlled error component. The final statistical rates are obtained by combining the results on the active and inactive components together. Moreover, the conclusion of Theorem 2 holds by truncating the βt properly, since we are able to control the error component of βt uniformly as mentioned above. See Appendix §E.1 for the detail. As shown in the proof, we observe that with small initialization and over-parameterized loss function, the signal component converges rapidly to the true signal, while the the error component grows in a relatively slow pace. Thus, gradient descent rapidly isolates the signal components from the noise, and with a proper stopping time, finds a near-sparse solution with high statistical accuracy. Thus, with proper initialization, over-parameterization plays the role of an implicit regularization by favoring approximately sparse saddle points of the loss function in (9).

Finally, we remark that Theorem 1 establishes optimal statistical rates for the estimator βT1, where T1 is any stopping time that belongs to the interval given in Theorem 1. However, in practice, such an interval is infeasible to compute as it depends on unknown constants. To make the proposed method practical, in the following, we introduce a method for selecting a proper stopping time T1.

3.1.2. Choosing the Stopping Time T1

We split the dataset into training data and testing data. We utilize the training data to implement Algorithm 1 and get the estimator βt as well as the value of the training loss (9) at step t. We notice βt varies slowly inside the optimal time interval specified in Theorem 1, so that the fluctuation of the training loss (9) can be smaller than a threshold. Based on that, we choose m testing points on the flatted curve of the training loss (9) and denote their corresponding number of iterations as {tj}, j ∈ [m]. For each j ∈ [m], we then reuse the training data and normalized estimator βtj/1/2βtj2, j ∈ [m] to fit the link function f. Let the obtained estimator be f^j. For the testing dataset, we perform out-of-sample prediction and get m prediction losses:

lj=1ntesti=1ntest[Yif^j(xi,βtj/Σ1/2βtj2)]2,    j[m].

Next, we choose T1 as tj* where we define j* = argminj∈[m] lj.

We remark that each f^j can be obtained by any nonparametric regression methods. To show case our method, in the following, we apply univariate kernel regression to obtain each f^j and establish its theoretical guarantee.

3.1.3. Prediction Risk

We now consider estimating the nonparametric component and the prediction risk. Suppose we are given an estimator β^ of β and n i.i.d. observations {yi,xi}i=1n of the model. For simplicity of the technical analysis, we assume that β^ is independent of {yi,xi}i=1n, which can be achieved by data-splitting. Moreover, we assume that β^ is an estimator of β* such that

β^β*2=o(n1/3),   Σ1/2β^2=1,   and    Σ1/2β*2=1. (13)

Our goal is to construct an estimate the regression function f(〈·, β*〉) based on β^  and {yi,xi}i=1n.

Note that, when β* is known, we can directly estimate f based on yi and Zi*xiβ*, i ∈ [n] via standard non-parametric regression. When β^ is accurate, a direct idea is to replace Zi* by Zixiβ^ and follow the similar route. For a new observation x, we define Z as Zxβ^ and Z* as Z* ≔ x β* respectively.

To predict Y, we estimate function g(z) using kernel regression with data {(yi,xiβ^)}i=1n. Specifically, we let the function Kh(u) be Kh(u) ≔ 1/h · K(u/h), in which K: is a kernel function with K(u)=I{|u|1} and h is a bandwidth. By the definitions of Z*, Z, and Zi, i ∈ [n] given above, the prediction function g^(Z) is defined as

g^(Z)={i=1nyiKh(ZZi)i=1nKh(ZZi),|Zμβ^|R,0,otherwise, (14)

where we follow the convention that 0/0 = 0. In what follows, we consider the 2-prediction risk of g^, which is given by

E[{g^(x,β^)f(x,β*)}2],

where the expectation is taken with respect to x and {xi,yi}i=1n. Before proceeding to the theoretical guarantees, we make the following assumption on the regularity of f.

Assumption 2. There exists an α1 > 0 and a constant C > 0 such that |f(x)|, |f(x)|C+|x|α1.

For the rationality of the Assumption 2, we note that the constraint on f′(x) and f(x) given above is weaker than assuming f′(x) and f(x) are bounded functions directly. Next, we present Theorem 3 which characterizes the convergence rate of mean integrated error of our prediction function g^(Z).

Theorem 3. If we set R=2log(n) and hn−1/3 in (14), under Assumption 2, the ℓ2-prediction risk of g^ defined in (14) is given by

E[{g^(x,β^)f(x,β*)}2]polylog(n)n2/3,

where β^ is any vector that satisfies (13) and polylog(n) contains terms that are polynomials of log n.

It is worth noting that the estimator β^=βT1/Σ1/2βT12 constructed in Theorem 1 with any T1 belongs to the optimal time interval given in Theorem 1 satisfy (13). Thus, under such regimes, Theorem 3 also holds. The proof of Theorem 3 is given in §E.3. Note that it is possible to refine the analysis on the prediction risk for f with higher order derivatives by utilizing higher order kernels (see Tsybakov [2008] therein) this is not the key message of our paper.

3.2. General Design

In this subsection, we extend our methodology to the setting with covariates generated from a general distribution. Following our discussions at the beginning of §3, ideally we aim at solving the loss function with over-parameterized variable given in (9). However, when the distribution of x has density p0, the score S(x) can be heavy-tailed such that E[YS(x)] and its empirical counterpart may not be sufficiently close.

To remedy this issue, we modify the loss function in (9) by replacing yi and S(xi) by their truncated (Winsorized) version yiˇ and Sˇ, respectively. Specifically, we propose to apply gradient descent to the following modified loss function with respect to u and v:

min w,vL(w,v)wwvv,wwvv2ni=1nyiˇwwvv,Sˇ(xi). (15)

Let aˇd denote the truncated version of vector ad based on a parameter τ [Fan et al., 2021b]. That is, its entries are given by [aˇ]j=[a]j if |ai| ⩽ τ and τ otherwise. Applying elementwise truncation to {yi}i=1n and {S(xi)}i=1n in (15), we allow the score S(x) and the response Y to both have heavy-tailed distributions. By choosing a proper threshold τ, such a truncation step ensures n1i=1nyˇiSˇ(xi) converge to E[YS(x)] with a desired rate in -norm. Compared with Algorithm 1, here we only modify the definition of the loss function. Thus, we defer the details of the proposed algorithm for this setting to Algorithm 3 in §E.5.

Before stating our main theorem, we first present an assumption on the distributions of the covariate and the response variables.

Assumption 3. Assume there exists a constant M such that

E[Y4]M,    E[S(x)j4]M,    j[p].

Here S(x)j is the j-th entry of S(x). Moreover, recall that we denote μ*=E[f(x,β*)]. We assume that μ* is a nonzero constant such that M/|μ*| = Θ(1).

Assuming the fourth moments exist and are bounded is significant weaker than the sub-Gaussian assumption. Moreover, such an assumption is prevalent in robust statistics literature [Fan et al., 2021c, 2018, 2019]. Now we are ready to introduce the theoretical results for the setting with general design.

Theorem 4. Under Assumption 3, we set the thresholding parameter τ = (M · n/log p)1/4/2, let the initialization parameter α satisfy 0<αMg2/p, and set the stepsize η such that 0 < η ⩽ 1/(12(|μ*| + Mg)) in Algorithm 3 given in §E.5 where Mg is a constant proportional to M. There exist absolute constants a3, a4, such that, with probability at least 1 − 2p−2,

βT1μ*β*22s log pn

holds for all T1[a3 log(1/α)/(η(|μ*|smMglog p/n)),a4 log(1/α)n/log p/(ηMg)]. Here s is the cardinality of the support set S and sm=miniS0|βj*|, where S0={ji:|βi|Cslog p/n} is the set of strong signals. In addition, for the normalized iterates, we further have

βT1βT12μ*β*|μ*|22s log pn,

with probability at least 1 − 2p−2.

Compared with Theorem 1 for the Gaussian design, here we achieve the O(s log p/n) statistical rate of convergence in terms of the 2-norm. These rates are the same of those achieved by adding an 1-norm regularization explicitly [Plan and Vershynin, 2016, Plan et al., 2017, Yang et al., 2017] and are minimax optimal [Raskutti et al., 2011]. Moreover, we note that here S(x) and Y can be both heavy-tailed and our truncation procedure successfully tackles such a challenge without sacrificing the statistical rates. Moreover, similar to the Gaussian case, here Cs can be set as a sufficiently large absolute constant, and the statistical rates established in Theorem 4 holds for all choices of Cs. In addition, for heavy-tailed case, we also let [β˜T1]i=[βT1]iI|[βT1]i|λ, for all i ∈ [p]. Then for all λ[α,(Cs|μ*|2Mg)log p/n], we obtain similar theoretical guarantees as in Theorem 2.

4. Main Results for Over-Parametrized Low Rank SIM

In this section, we present the results for over-parameterized low rank matrix SIM introduced in Definition 3 with both standard Gaussian and generally distributed covariates. Similar to the results in §3, here we also focus on matrix SIM with first-order links, i.e., we assume that μ*=E[f(X,β*)]0, where β* is a low rank matrix with rank r. Note that we assume that the entries of covariate Xd×d are i.i.d. with a univariate density p0. Also recall that we define the score function S(X)d×d in (5). Then, similar to the loss function in (9), we consider the loss function

L(β)β,β2β,1ni=1nyiS(Xi),

where βd×d is a symmetric matrix. Hereafter, we rewrite β as WWVV, where both W and V are matrices in d×d. The intuitions of re-parameterizing β = WWVV are as follows. Any (low rank) symmetric matrix is able to be written as the difference of two positive semidefinite matrices, namely WWVV with W, Vd×d. Re-parameterizing the symmetric matrix this way is a generalization of re-parameterizing its eigenvalues by the Hadamard products. Thus this can be regarded as an extension of the re-parameterization mechanism from the vector case to the spectral domain. With such an over-parameterization, we propose to estimate β* by applying gradient descent to the loss function

L(W,V)WWVV,WWVV2WWVV,1ni=1nyiS(Xi). (16)

Since the rank of β* is unknown, we initialization W0 and V0 as W0=V0=αId×d for a small α > 0 and construct a sequence of iterates {Wt, Vt, βt}t⩾0 via the gradient decent method as follows:

Wt+1=Wtη(WtWtVtVt12ni=1nS(Xi)yi12ni=1nS(Xi)yi)Wt, (17)
Vt+1=Vt+η(WtWtVtVt12ni=1nS(Xi)yi12ni=1nS(Xi)yi)Vt,βt+1=WtWtVtVt, (18)

where η in (17) and (18) is the stepsize. Note that here the algorithm does not impose any explicit regularization. In the rest of this section, we show that such a procedure yields an estimator of the true parameter β* with near-optimal statistical rates of convergence.

Similar to the vector case, for theoretical analysis, here we also divide eigenvalues of β* into different groups by their strengths. We let ri*, i ∈ [d] be the i-th eigenvalue of β*. The support set R of the eigenvalues is defined as R{i:|ri*|>0}, whose cardinality is r. We then divide the support set R into R0{i:|ri*|Cmsd log d/n} and R1{i:0<|ri*|<Cmsd log d/n}, which correspond to collections of strong and weak signals with cardinality denoting by r0 and r1, respectively. Here Cms > 0 is an absolute constant and we have R = R0R1. Moreover, we use rm to denote the minimum strong eigenvalue in magnitude, i.e. rm=miniR0|ri*|.

4.1. Gaussian Design

In this subsection, we focus on the model in (7) with the entries of covariate X being i.i.d. N(0, 1) random variables. In this case, S(Xi) = Xi. This leads to Algorithm 4 given in §F.1, where we place S(Xi) by Xi in (16)–(18).

Similar to the case in §3.1, here we also impose the following assumption for the function class of the low rank SIM.

Assumption 4. We assume that μ*=E[f(X,β*)] is a nonzero constant. Moreover, we assume that both {f(Xi,β*)}i=1n and {ϵi}i=1n are i.i.d. sub-Gaussian random variables, with sub-Gaussian norm denoted by fψ2=O(1)  and σ=O(1) respectively. Here we let fψ2 denote the sub-Gaussian norm of f(〈X, β*〉). In addition, we further assume |μ*|/fψ2=Θ(1), |μ*|/σ = Ω(1).

The following theorem establishes the statistical rates of convergence for the estimator constructed by Algorithm 4.

Theorem 5. We set and stepsize 0<αMm2/d and stepsize 0<η1/[12(|μ*|+Mm)] in Algorithm 4, where Mm is a constant proportional to max{fψ2,σ}. Under Assumption 4, there exist constants a5, a6 such that, with probability at least 1 – 1/(2d) – 3/n2, we have

βT1μ*β*F2rd log dn

for all T1[a5 log(1/α)/(η(|μ*|rmMmd log d/n)),a6 log(1/α)n/(d log d)/(ηMm)]. Moreover, for the normalized iterates βt/∥βtF, we have

βT1βT1Fμ*β*|μ*|F2rd log dn.

Similar to the vector case given in §3.1, as shown in the proof in Appendix §F, here we require Cms to satisfy Cms ⩾ max{(a5/a6 + 1)Mm{|μ*|, 2Mm/|μ*|} in order to let the strong signals in R0 dominate the noise and let the interval for T1 to exist. The statistical rates hold for all such a Cms. As shown in Theorem 5, with the proper choices of initialization parameter α, stepsize η, and the stopping time T1, Algorithm 4 constructs an estimator that achieves near-optimal statistical rates of convergence (up to logarithmic factors compared to minimax lower bound [Rohde and Tsybakov, 2011]). Notice that the statistical rates established in Theorem 5 are also enjoyed by the M-estimator based on the least-squares loss function with nuclear norm penalty [Plan and Vershynin, 2016, Plan et al., 2017]. Thus, in terms of statistical estimation, applying gradient descent to the over-parameterized loss function in (16) is equivalent to adding a nuclear norm penalty explicitly, hence demonstrating the implicit regularization effect. Except for obtaining the optimal 2-statistical rate, we are able to recover the true rank with high-probability by properly truncating the eigenvalues of βT1 for all T1[a5 log(1/α)/(η(|μ*|rmMmd log d/n)),a6 log(1/α)n/(d log d)/(ηMm)]. Comparing with the literature Lee et al. [2015] which studies the rank consistency via 1-regularization, we offer a wider range for choosing the tuning parameter with known left boundary α, instead of only setting the nuclear tuning parameter λ=Θ˜(rd/n).

Theorem 6. (Rank Consistency) Under the setting of Theorem 5, for all

T1[a5 log(1/α)/(η(|μ*|rmMmd log d/n)),a6 log(1/α)n/(d log d)/(ηMm)],

we let β˜T1=i=1duiuiλi(βT1)I{|λi(βT1)|λ}, for all i ∈ [d]. Here uk, k ∈ [d] are eigenvectors of βT1. Then, with probability at least 1 − 2d−1 − 3n−2, for all λ[α,(Cms|μ*|2Mm)d log d/n], we have β˜T1 enjoys the conclusion of Theorem 5, and rank(β˜T1)rank(β*). Moreover, when there only exists strong signals in R0, we further have rank(β˜T1)=rank(β*).

Furthermore, our method extends the existing works that focus on designing algorithms and studying implicit regularization phenomenon in noiseless linear matrix sensing models with positive semidefinite signal matrices [Gunasekar et al., 2017, Li et al., 2018, Arora et al., 2019, Gidel et al., 2019]. Specifically, we allow a more general class of (noisy) models and symmetric signal matrices. Compared with Li et al. [2018], our methodology possesses several strengths, which include achieving low sample complexity (O˜(rd) insted of O˜(r2d)., allowing weak signals (miniR|ri*|zO((1/n)1/2) instead of miniR|ri*|O((1/n)1/6)), getting tighter statistical rate under noisy models (O˜(dr/n) instead of O˜(κrd/n)), and applying to a more general class of noisy statistical models. These strengths are achieved by the use of score transformation together with a refined trajectory analysis, which involves studying the dynamics of eigenvalues inside the strong signal set elementwisely with multiple stages instead of only studying the dynamics of the minimum eigenvalue with two stages.

The way of choosing stopping time T1 in the case of matrix SIM is almost the same with our method in §3.1.2. The only difference between them is that here we replace xβ* by tr(Xβ*) Indeed, as we assume ∥Σ1/2β*∥2 = 1 in vector SIM and ∥β*∥F = 1 in matrix version for model identifiability, both xβt and tr(Xβt) follow the standard normal distribution. Thus, our results on the prediction risk in §3.1.3 can be applied here directly.

4.2. General Design

In the rest of this section, we focus on the low rank matrix SIM beyond Gaussian covariates. Hereafter, we assume the entries of X are i.i.d. random variables with a known density function p0:. Recall that, according to the remarks following Definition 1, the score function S(X)d×d is defined as

S(X)j,kS(Xj,k)=p0(Xj,k)/p0(Xj,k),

where S(X)j,k and Xj,k are the (j, k)-th entries of S(X) and X for all j, k ∈ [d]. However, similar to the results in §3.2, the entries of S(X) can have heavy-tailed distributions and thus n1i=1nyiS(Xi) may not converge its expectation E[YS(X)] efficiently in terms of spectral norm. Here Xi is the i-th observation of the covariate X. To tackle such a challenge, we employ a shrinkage approach [Catoni et al., 2012, Fan et al., 2021c, Minsker, 2018] to construct a robust estimator of E[YS(X)]. Specifically, we let

ϕ(x)={log(1x+x2/2),x0,log(1+x+x2/2),x>0,

which is approximately x when x is small and grows at logarithmic rate for large x. The rescaled version λ−1ϕ(λx) for λ → 0 behaves like a soft-winsorizing function, which has been widely used in statistical mean estimation with finite bounded moments [Catoni et al., 2012, Brownlees et al., 2015]. For any matrix Xd×d, we apply spectral decomposition to its Hermitian dilation and obtain

X*[0XX0]=QΣ*Q,

where Σ*2d×2d is a diagonal matrix. Based on such a decomposition, we define X˜=Qϕ(Σ*)Q, where ϕ applies elementwisely to Σ*. Then we write X˜ as a block matrix as

X˜[X˜11X˜12X˜21X˜22],

where each block of X˜ is in d×d. We further define a mapping ϕ1:d×dd×d by letting ϕ1(X)X˜12, which is a regularized version of X. Given data y1, X1, we finally define H() as

H(y1S(X1),κ)1/κϕ1(κy1S(X1)),     κ>0, (19)

where κ is a thresholding parameter, converging to zero. This method is in a similar spirit of robustifying the singular value of X. Based on the operator H defined in (19), we define a loss function L(W, V) as

L(W,V)WWVV,WWVV2ni=1nWWVV,H(yiS(Xi),κ)). (20)

After over-parameterizing β as WWVV, we propose to construct an estimator of β* by applying gradient descent on the following loss function in (20) with respect to W, V. See Algorithm 5 in §F.5 for the details of the algorithm.

In the following, we present the statistical rates of convergence for the obtained estimator. We first introduce the assumption on Y and p0.

Assumption 5. We assume that both the response variable Y and entries of S(X) have bounded fourth moments. Specifically, there exists an absolute constant M such that

E[Y4]M,    E[S(X)i,j4]M,    (i,j)[d]×[d].

Moreover, we assume that μ*=E[f(X,β*)] is a nonzero constant such that |μ*|/M = Θ(1).

Next, we present the main theorem for low rank matrix SIM.

Theorem 7. In Algorithm 5, we set parameter κ in (19) as κ=log(4d)/(ndM) and let the initialization parameter α and the stepsize η satisfy 0<αMmg2/d  and 0<η1/[12(|μ*|+Mmg)], where Mmg is a constant proportional to M. Then, under Assumption 5, there exist absolute constants a7, a8 such that, with probability at least 1 – (4d)−2, we have

βT1μ*β*F2rd log dn,

for all T1[a7 log(1/α)/(η(|μ*|rmMmgd log d/n)),a8 log(1/α)n/(d log d)/(ηMmg)], Moreover, for the normalized iterate βt/∥βtF, we have

βT1βT1Fμ*β*|μ*|F2rd log dn.

For low rank matrix SIM, when the hyperparameters of the gradient descent algorithm are properly chosen, we also capture the implicit regularization phenomenon by applying a simple optimization procedure to over-parameterized loss function with heavy-tailed measurements. Here, applying the thresholding operator H in (19) can also be viewed as a data pre-processing step, which arises due to handling heavy-tailed observations. Note that the way of choosing Cms here is similar with the way in Theorem 5, in order to ensure the convergence rate and existence of a time interval, so we omit the details. Note that the 2-statistical rate given in Theorem 7 are minimax optimal up to a logarithmic term [Rohde and Tsybakov, 2011]. Similar results were also obtained by Plan and Vershynin [2016], Yang et al. [2017], Goldstein et al. [2018], Na et al. [2019] via adding explicit nuclear norm regularization. Thus, in terms of statistical recovery, when employing the thresholding in (19) and over-parameterization, gradient descent enforces implicit regularization that has the same effect as the nuclear norm penalty. In addition, in terms of the rank consistency result for the heavy-tailed case, if we also let β˜T1=i=1duiuiλi(βT1)I{|λi(βT1)|λ}, then for all λ[α,(Cs|μ*|2Mmg)d log d/n], we achieve the same results with Theorem 6.

5. Conclusion

In this paper, we leverage over-parameterization to design regularization-free algorithms for single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. We consider the case where the link function is unknown, the distribution of the covariates is known as a prior, and the signal parameter is either a s-sparse vector in p or a rank-r matrix in d×d. Using the score function and the Stein’s identity, we propose an over-parameterized nonlinear least-squares loss function. To handle the possibly heavy-tailed distributions of the score functions and the response variables, we adopt additional truncation techniques that robustify the loss function. For both the vector and matrix SIMs, we construct an estimator of the signal parameter by applying gradient descent to the proposed loss function, without any explicit regularization. We prove that, when initialized near the origin, gradient descent with a small stepsize finds an estimator that enjoys minimax-optimal statistical rates of convergence. Moreover, for vector SIM with Gaussian design, we further obtain the oracle statistical rates that are independent of the ambient dimension. Furthermore, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both 2-statistical rate and variable selection consistency.

Supplementary Material

Supplementary

Acknowledgments

Research supported by the NSF grant DMS-1662139 and DMS-1712591, the ONR grant N00014-19-1-2120, and the NIH grant 2R01-GM072611-16.

Contributor Information

Jianqing Fan, Frederick L. Moore ‘18 Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering at the Princeton University..

Zhuoran Yang, Ph.D. students at Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA..

Mengxin Yu, Ph.D. students at Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA..

References

  1. Arora S, Cohen N, Hu W, and Luo Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7411–7422, 2019. [Google Scholar]
  2. Arulkumaran K, Deisenroth MP, Brundage M, and Bharath AA. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38, 2017. [Google Scholar]
  3. Brownlees C, Joly E, and Lugosi G. Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43(6):2507–2536, 2015. [Google Scholar]
  4. Candés EJ. The restricted isometry property and its implications for compressed sensing. Comptes rendus-Mathematique, 9(346):589–592, 2008. [Google Scholar]
  5. Candés EJ, Eldar YC, Strohmer T, and Voroninski V. Phase retrieval via matrix completion. SIAM review, 57(2):225–251, 2015. [Google Scholar]
  6. Catoni O et al. Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148–1185, 2012. [Google Scholar]
  7. Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, and Bengio Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014. [Google Scholar]
  8. Donoho DL. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006. [Google Scholar]
  9. Fan J and Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [Google Scholar]
  10. Fan J, Xue L, and Zou H. Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42(3):819, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fan J, Liu H, and Wang W. Large covariance estimation through elliptical factor models. Annals of Statistics, 46(4):1383–1414, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan J, Wang W, and Zhong Y. Robust covariance estimation for approximate factor models. Journal of Econometrics, 208(1):5–22, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J, Ma C, and Zhong Y. A selective overview of deep learning. Statistical Science, 36(2):264–290, 2021a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fan J, Wang K, Zhong Y, and Zhu Z. Robust high dimensional factor models with applications to statistical machine learning. Statistical Science, 36:303–327, 2021b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fan J, Wang W, and Zhu Z. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of Statistics, 49(3):1239–1266, 2021c. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gidel G, Bach F, and Lacoste-Julien S. Implicit regularization of discrete gradient dynamics in linear neural networks. In Advances in Neural Information Processing Systems, pages 3196–3206, 2019. [Google Scholar]
  17. Goldstein L and Wei X. Non-Gaussian observations in nonlinear compressed sensing via Stein discrepancies. Information and Inference: A Journal of the IMA, 8(1):125–159, 2019. [Google Scholar]
  18. Goldstein L, Minsker S, and Wei X. Structured signal recovery from non-linear and heavy-tailed measurements. IEEE Transactions on Information Theory, 64(8):5513–5530, 2018. [Google Scholar]
  19. Goodfellow I, Bengio Y, and Courville A. Deep learning. MIT press, 2016. [Google Scholar]
  20. Gunasekar S, Woodworth BE, Bhojanapalli S, Neyshabur B, and Srebro N. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017. [Google Scholar]
  21. Hoff PD. Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization. Computational Statistics & Data Analysis, 115:186–198, 2017. [Google Scholar]
  22. Ke Y, Minsker S, Ren Z, Sun Q, Zhou W-X, et al. User-friendly covariance estimation for heavy-tailed distributions. Statistical Science, 34(3):454–471, 2019. [Google Scholar]
  23. Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, and Tang PTP. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017. [Google Scholar]
  24. LeCun Y, Bengio Y, and Hinton G. Deep learning. nature, 521(7553):436–444, 2015. [DOI] [PubMed] [Google Scholar]
  25. Lee JD, Sun Y, and Taylor JE. On model selection consistency of regularized M-estimators. Electronic Journal of Statistics, 9(1):608–642, 2015. [Google Scholar]
  26. Li Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017. [Google Scholar]
  27. Li Y, Ma T, and Zhang H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018. [Google Scholar]
  28. Li Y Luo, and Lyu K. Towards resolving the implicit bias of gradient descent for matrix factorization:Greedy low-rank learning. In International Conference on Learning Representations, 2021. [Google Scholar]
  29. McMahan B, Moore E, Ramage D, Hampson S, and Arcas B. A. y.. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Singh A and Zhu J, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1273–1282. PMLR, 20–22 Apr 2017. [Google Scholar]
  30. Minsker S. Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Annals of Statistics, 46(6A):2871–2903, 2018. [Google Scholar]
  31. Minsker S and Wei X. Robust modifications of U-statistics and applications to covariance estimation problems. Bernoulli, 26(1):694–727, 2020. [Google Scholar]
  32. Na S, Yang Z, Wang Z, and Kolar M. High-dimensional varying index coefficient models via stein’s identity. Journal of Machine Learning Research, 20(152):1–44, 2019. [Google Scholar]
  33. Neykov M, Liu JS, and Cai T. 1-regularized least squares for support recovery of high dimensional single index models with Gaussian designs. Journal of Machine Learning Research, 17(1):2976–3012, 2016. [PMC free article] [PubMed] [Google Scholar]
  34. Neyshabur B, Tomioka R, and Srebro N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015. [Google Scholar]
  35. Neyshabur B, Tomioka R, Salakhutdinov R, and Srebro N. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017. [Google Scholar]
  36. Otter DW, Medina JR, and Kalita JK. A survey of the usages of deep learning in natural language processing. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020. [DOI] [PubMed] [Google Scholar]
  37. Plan Y and Vershynin R. The generalized Lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016. [Google Scholar]
  38. Plan Y, Vershynin R, and Yudovina E. High-dimensional estimation with geometric constraints. Information and Inference: A Journal of the IMA, 6(1):1–40, 2017. [Google Scholar]
  39. Poggio T, Kawaguchi K, Liao Q, Miranda B, Rosasco L, Boix X, Hidary J, and Mhaskar H. Theory of deep learning III: Explaining the non-overfitting puzzle. arXiv preprint arXiv:1801.00173, 2017. [Google Scholar]
  40. Raskutti G, Wainwright MJ, and Yu B. Minimax rates of estimation for high-dimensional linear regression over q-balls. TIT, 57(10):6976–6994, 2011. [Google Scholar]
  41. Richards D and Rebeschini P. Graph-dependent implicit regularisation for distributed stochastic subgradient descent. Journal of Machine Learning Research, 21(34):1–44, 2020.34305477 [Google Scholar]
  42. Richards D, Rebeschini P, and Rosasco L. Decentralised learning with distributed gradient descent and random features. 2020.
  43. Rohde A and Tsybakov AB. Estimation of high-dimensional low-rank matrices. Annals of Statistics, 39(2): 887–930, 2011. [Google Scholar]
  44. Shechtman Y, Eldar YC, Cohen O, Chapman HN, Miao J, and Segev M. Phase retrieval with application to optical imaging: a contemporary overview. IEEE signal processing magazine, 32(3):87–109, 2015. [Google Scholar]
  45. Stein C, Diaconis P, Holmes S, Reinert G, et al. Use of exchangeable pairs in the analysis of simulations. In Stein’s Method, pages 1–25. Institute of Mathematical Statistics, 2004. [Google Scholar]
  46. Stein C et al. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, volume 2. The Regents of the University of California, 1972. [Google Scholar]
  47. Swirszcz G, Czarnecki WM, and Pascanu R. Local minima in training of neural networks. In International Conference on Learning Representations, 2016. [Google Scholar]
  48. Torfi A, Shirvani RA, Keneshloo Y, Tavvaf N, and Fox EA. Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200, 2020. [Google Scholar]
  49. Tsybakov AB. Introduction to Nonparametric Estimation. Springer, 2008. ISBN 0387790519. [Google Scholar]
  50. Vaškevičius T, Kanade V, and Rebeschini P. Implicit regularization for optimal sparse recovery. In Advances in Neural Information Processing Systems, pages 2972–2983, 2019. [Google Scholar]
  51. Voulodimos A, Doulamis N, Doulamis A, and Protopapadakis E. Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, 2018, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wei X. Structured recovery with heavy-tailed measurements: A thresholding procedure and optimal rates. arXiv preprint arXiv:1804.05959, 2018. [Google Scholar]
  53. Wei X and Minsker S. Estimation of the covariance structure of heavy-tailed distributions. In Advances in Neural Information Processing Systems, pages 2859–2868, 2017. [Google Scholar]
  54. Wilson AC, Roelofs R, Stern M, Srebro N, and Recht B. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017. [Google Scholar]
  55. Yang Z, Balasubramanian K, and Liu H. High-dimensional non-Gaussian single index models via thresholded score function estimation. In International Conference on Machine Learning, pages 3851–3860. JMLR. org, 2017. [Google Scholar]
  56. Yun C, Sra S, and Jadbabaie A. Small nonlinearities in activation functions create bad local minima in neural networks. In International Conference on Learning Representations, 2019. [Google Scholar]
  57. Zhang C, Bengio S, Hardt M, Recht B, and Vinyals O. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations, 2017. [Google Scholar]
  58. Zhang C-H et al. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38 (2):894–942, 2010. [Google Scholar]
  59. Zhao P, Yang Y, and He Q-C. Implicit regularization via Hadamard product over-parametrization in high-dimensional linear regression. arXiv preprint arXiv:1903.09367, 2019. [Google Scholar]
  60. Zhu Z. Taming the heavy-tailed features by shrinkage and clipping. arXiv preprint arXiv:1710.09020, 2017. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary

RESOURCES