Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 8.
Published in final edited form as: Ann Stat. 2014 May 20;42(3):872–917. doi: 10.1214/13-AOS1202

Endogeneity in High Dimensions

Jianqing Fan 1, Yuan Liao 2
PMCID: PMC4286899  NIHMSID: NIHMS649193  PMID: 25580040

Abstract

Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogenous. Yet, endogeneity can arise incidentally from a large pool of regressors in a high-dimensional regression. This causes the inconsistency of the penalized least-squares method and possible false scientific discoveries. A necessary condition for model selection consistency of a general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the incidental endogeneity, we construct a novel penalized focused generalized method of moments (FGMM) criterion function. The FGMM effectively achieves the dimension reduction and applies the instrumental variable methods. We show that it possesses the oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.

Keywords: Focused GMM, Sparsity recovery, Endogenous variables, Oracle property, Conditional moment restriction, Estimating equation, Over identification, Global minimization, Semi-parametric efficiency

1. Introduction

In high-dimensional models, the overall number of regressors p grows extremely fast with the sample size n. It can be of order exp(nα), for some α ∈ (0, 1). What makes statistical inference possible is the sparsity and exogeneity assumptions. For example, in the linear model

Y=XTβ0+ε, (1.1)

it is assumed that the number of elements in S = {j : β0j 0} is small and X = 0, or more stringently

E(ε|X)=E(YXTβ0|X)=0. (1.2)

The latter is called “exogeneity”. One of the important objectives of high-dimensional modeling is to achieve the variable selection consistency and make inference on the coefficients of important regressors. See, for example, Fan and Li (2001), Hunter and Li (2005), Zou (2006), Zhao and Yu (2006), Huang, Horowitz and Ma (2008), Zhang and Huang (2008), Wasserman and Roeder (2009), Lv and Fan (2009), Zou and Zhang (2009), Städler, Bühlmann and van de Geer (2010), and Bühlmann, Kalisch and Maathuis (2010). In these papers, (1.2) (or X = 0) has been assumed either explicitly or implicitly1. Condition of this kind is also required by the Dantzig selector of Candès and Tao (2007), which solves an optimization problem with constraint maxjp|1ni=1nXij(YiXiTβ)|<Clogpn for some C > 0.

In high-dimensional models, requesting that ε and all the components of X be uncorrelated as (1.2), or even more specifically

E(YXTβ0)Xj=0,j=1,,p, (1.3)

can be restrictive particularly when p is large. Yet, (1.3) is a necessary condition for popular model selection techniques to be consistent. However, violations to assumption either (1.2) or (1.3) can arise as a result of selection biases, measurement errors, autoregression with autocorrelated errors, omitted variables, and from many other sources (Engle, Hendry and Richard 1983). They also arise from unknown causes due to a large pool of regressors, some of which are incidentally correlated with the random noise YXTβ0. For example, in genomics studies, clinical or biological outcomes along with expressions of tens of thousands of genes are frequently collected. After applying variable selection techniques, scientists obtain a set of genes Ŝ that are responsible for the outcome. Whether (1.3) holds, however, is rarely validated. Because there are tens of thousands of restrictions in (1.3) to validate, it is likely that some of them are violated. Indeed, unlike low-dimensional least-squares, the sample correlations between residuals ∊̂, based on the selected variables XŜ, and predictors X, are unlikely to be small, because all variables in the large set Ŝc are not even used in computing the residuals. When some of those are unusually large, endogeneity arises incidentally. In such cases, we will show that Ŝ can be inconsistent. In other words, violation of assumption (1.3) can lead to false scientific claims.

We aim to consistently estimate β0 and recover its sparsity under weaker conditions than (1.2) or (1.3) that are easier to validate. Let us assume that β0=(β0ST,0)T and X can be partitioned as X=(XST,XNT)T. Here XS corresponds to the nonzero coefficients β0S, which we call important regressors, and XN represents the unimportant regressors throughout the paper, whose coefficients are zero. We borrow the terminology of endogeneity from the econometric literature. A regressor is said to be endogenous when it is correlated with the error term, and exogenous otherwise. Motivated by the aforementioned issue, this paper aims to select XS with probability approaching one and making inference about β0S, allowing components of X to be endogenous. We propose a unified procedure that can address the problem of endogeneity to be present in either important or unimportant regressors, or both, and we do not require the knowledge of which case of endogeneity is present in the true model. The identities of XS are unknown before the selection.

The main assumption we make is that, there is a vector of observable instrumental variables W such that

E[ε|W]=0. (1.4)

2

Briefly speaking, W is called an “instrumental variable” when it satisfies (1.4) and is correlated with the explanatory variable X. In particular, as noted in the footnote, W = XS is allowed so that the instruments are unknown but no additional data are needed. Instrumental variables (IV) have been commonly used in the literature of both econometrics and statistics in the presence of endogenous regressors, to achieve identification and consistent estimations (e.g., Hall and Horowitz 2005). An advantage of such an assumption is that it can be validated more easily. For example, when W = XS, one needs only to check whether the correlations between ∊̂ and XŜ are small or not, with XŜ being a relatively low-dimensional vector, or more generally, the moments that are actually used in the model fitting such as (1.5) below hold approximately In short, our setup weakens the assumption (1.2) to some verifiable moment conditions.

What makes the variable selection consistency (with endogeneity) possible is the idea of over identification. Briefly speaking, a parameter is called “over-identified” if there are more restrictions than those are needed to grant its identifiability (for linear models, for instance, when the parameter satisfies more equations than its dimension). Let (f1,…, fp) and (h1,…, hp) be two different sets of transformations, which can be taken as a large number of series terms, e.g., B-splines and polynomials. Here each fj and hj are scalar functions. Then (1.4) implies

E(εfj(W))=E(εhj(W))=0,j=1,,p.

Write F = (f1(W),…, fp(W))T, and H = (h1(W),…, hp(W))T. We then have F = H = 0. Let S be the set of indices of important variables, and let FS and HS be the subvectors of F and H corresponding to the indices in S. Implied by F = H = 0, and ε=YXSTβ0S, there exists a solution βS = β0S to the over-identified equations (with respect to βS) such as

E(YXSTβS)FS=0andE(YXSTβS)HS=0. (1.5)

In (1.5), we have twice as many linear equations as the number of unknowns, yet the solution exists and is given by βS = β0S. Because β0S satisfies more equations than its dimension, we call β0S to be over-identified. On the other hand, for any other set of variables, if S, then the following 2|| equations (with || = dim(β) unknowns)

E(YXSTβS)FS=0andE(YXSTβS)HS=0 (1.6)

have no solution as long as the basis functions are chosen such that FH .3 The above setup includes W = XS with F = X and H = X2 as a specific example (or H = cos(X) + 1 if X contain many binary variables).

We show that in the presence of endogenous regressors, the classical penalized least squares method is no longer consistent. Under model

Y=XSTβ0S+ε,E(ε|W)=0,

we introduce a novel penalized method, called focused generalized method of moments (FGMM), which differs from the classical GMM (Hansen 1982) in that the working instrument V(β) in the moment functions n1i=1n(YiXiTβ)V(β) for FGMM also depends irregularly on the unknown parameter β (which also depends on (F, H), see Section 3 for details). With the help of over identification, the FGMM successfully eliminates those subset such that S. As we will see in Section 3, a penalization is still needed to avoid over-fitting. This results in a novel penalized FGMM.

We would like to comment that FGMM differs from the low-dimensional techniques of either moment selection (Andrews 1999, Andrews and Lu 2001) or shrinkage GMM (Liao 2013) in dealing with mis-specifications of moment conditions and dimension reductions. The existing methods in the literature on GMM moment selections cannot handle high-dimensional models. Recent literature on the instrumental variable method for high-dimensional models can be found in, e.g., Belloni et al. (2012), Caner and Fan (2012), García (2011). In these papers, the endogenous variables are in low dimensions. More closely related work is by Gautier and Tsybakov (2011), who solved a constrained minimization as an extension of Dantzig selector. Our paper, in contrast, achieves the oracle property via a penalized GMM. Also, we study a more general conditional moment restricted model that allows nonlinear models.

The remainder of this paper is as follows: Section 2 gives a necessary condition for a general penalized regression to achieve the oracle property. We also show that in the presence of endogenous regressors, the penalized least squares method is inconsistent. Sections 3 constructs a penalized FGMM, and discusses the rationale of our construction. Section 4 shows the oracle property of FGMM. Section 5 discusses the global optimization. Section 6 focuses on the semi-parametric efficient estimation after variable selection. Section 7 discusses numerical implementations. We present simulation results in Section 8. Finally, Section 9 concludes. Proofs are given in the appendix.

Notation

Throughout the paper, let λmin(A) and λmax(A) be the smallest and largest eigenvalues of a square matrix A. We denote by ‖AF, ‖A‖ and ‖A as the Frobenius, operator and element-wise norms of a matrix A respectively, defined respectively as ‖AF = tr1/2(AT A), A=λmax1/2(ATA), and ‖A = maxi,jAij‖. For two sequences an and bn, write anbn (equivalently, bnan) if an = o(bn). Moreover, |β|0 denotes the number of nonzero components of a vector β. Finally, Pn(t) and Pn(t) denote the first and second derivatives of a penalty function Pn(t), if exist.

2. Necessary Condition for Variable Selection Consistency

2.1. Penalized regression and necessary condition

Let s denote the dimension of the true vector of nonzero coefficients β0S. The sparse structure assumes that s is small compared to the sample size. A penalized regression problem, in general, takes a form of:

minβpLn(β)+j=1pPn(|βj|),

where Pn(·) denotes a penalty function. There are relatively less attentions to the necessary conditions for the penalized estimator to achieve the oracle property. Zhao and Yu (2006) derived an almost necessary condition for the sign consistency, which is similar to that of Zou (2006) for the least squares loss with Lasso penalty. To the authors' best knowledge, so far there has been no necessary condition on the loss function for the selection consistency in the high-dimensional framework. Such a necessary condition is important, because it provides us a way to justify whether a specific loss function can result in a consistent variable selection.

Theorem 2.1 (Necessary Condition)

Suppose:

  1. Ln(β) is twice differentiable, and

    max1l,jp|2Ln(β0)βlβj|=Op(1).
  2. There is a local minimizer β̂ = (β̂S, β̂N)T of

    Ln(β)+j=1pPn(|βj|)

    such that P(β̂N = 0) → 1, and √sβ̂β0‖ = op(1).

  3. The penalty satisfies: Pn(·) 0, Pn(0) = 0, Pn(t) is non-increasing when t ∈ (0, u) for some u > 0, and limnlimt0+Pn(t)=0. Then for any lp,

|Ln(β0)βl|p0. (2.1)

The implication (2.1) is fundamentally different from the “irrepresentable condition” in Zhao and Yu (2006) and that of Zou (2006). It imposes a restriction on the loss function Ln(·), whereas the “irrepresentable condition” is derived under the least squares loss and E(εX) = 0. For the least squares, (2.1) reduces to either n1i=1nεiXil=op(1) or Xl = 0, which requires a exogenous relationship between ε and X. In contrast, the irrepresentable condition requires a type of relationship between important and unimportant regressors and is specific to Lasso. It also differs from the Karush-Kuhn-Tucker (KKT) condition (e.g., Fan and Lv 2011) in that it is about the gradient vector evaluated at the true parameters rather than at the local minimizer.

The conditions on the penalty function in condition (iii) are very general, and are satisfied by a large class of popular penalties, such as Lasso (Tibshirani 1996), SCAD (Fan and Li 2001) and MCP (Zhang 2010), as long as their tuning parameter λn 0. Hence this theorem should be understood as a necessary condition imposed on the loss function instead of the penalty.

2.2. Inconsistency of least squares with endogeneity

As an application of Theorem 2.1, consider a linear model:

Y=XTβ0+ε=XSTβ0S+ε, (2.2)

where we may not have E(εX) = 0.

The conventional penalized least squares (PLS) problem is defined as:

minβ1ni=1n(YiXiTβ)2+j=1pPn(|βj|).

In the simpler case when s, the number of nonzero components of β0, is bounded, it can be shown that if there exist some regressors correlated with the regression error ε, the PLS does not achieve the variable selection consistency. This is because (2.1) does not hold for the least squares loss function. Hence without the possibly ad-hoc exogeneity assumption, PLS would not work any more, as more formally stated below.

Theorem 2.2 (Inconsistency of PLS)

Suppose the data are i.i.d., s = O(1), and X has at least one endogenous component, that is, there is l such that |E(Xlε)| > c for some c > 0. Assume that EXl4<, 4 < ∞, and Pn (t) satisfies the conditions in Theorem 2.1. If β=(βST,βNT)T, corresponding to the coefficients of (XS, XN), is a local minimizer of

1ni=1n(YiXiTβ)2+j=1pPn(|βj|),

then either ‖β̃Sβ0S ‖ = op(1), or lim supn→ P(β̃N = 0) < 1.

The index l in the condition of the above theorem does not have to be an index of an important regressor. Hence the consistency for penalized least squares will fail even if the endogeneity is only present on the unimportant regressors.

We conduct a simple simulated experiment to illustrate the impact of endogeneity on the variable selection. Consider

Y=XTβ0+ε,ε~N(0,1),β0S=(5,4,7,2,1.5);β0j=0,for6jp.Xj=Zjforj5,Xj=(Zj+5)(1+),for6jp.Z~Np(0,),independent ofε,with()ij=0.5|ij|.

In the design, the unimportant regressors are endogenous. The penalized least squares (PLS) with SCAD-penalty was used for variable selection. The λ's in the table represent the tuning parameter used in the SCAD-penalty. The results are based on the estimated (β^ST,β^NT)T, obtained from minimizing PLS and FGMM loss functions respectively (we shall discuss the construction of FGMM loss function and its numerical minimization in detail subsequently). Here β̃S and β̂N represent the estimators for coefficients of important and unimportant regressors respectively.

From Table 1, PLS selects many unimportant regressors (FP). In contrast, the penalized FGMM performs well in both selecting the important regressors and eliminating the unimportant ones. Yet, the larger MSES of β̂S by FGMM is due to the moment conditions used in the estimate. This can be improved further in Section 6. Also, when endogeneity is present on the important regressors, PLS estimator will have larger bias (see additional simulation results in Section 8.)

Table 1.

Performance of PLS and FGMM over 100 replications. p = 50, n = 200

PLS FGMM
λ = 0.05 λ = 0.1 λ = 0.5 λ = 1 λ = 0.05 λ = 0.1 λ = 0.2
MSES 0.145 (0.053) 0.133 (0.043) 0.629 (0.301) 1.417 (0.329) 0.261 (0.094) 0.184 (0.069) 0.194 (0.076)
MSEN 0.126 (0.035) 0.068 (0.016) 0.072 (0.016) 0.095 (0.019) 0.001 (0.010) 0 (0) 0.001 (0.009)
TP 5 (0) 5 (0) 4.82 (0.385) 3.63 (0.504) 5 (0) 5 (0) 5 (0)
FP 37.68 (2.902) 35.36 (3.045) 8.84 (3.334) 2.58 (1.557) 0.08 (0.337) 0 (0) 0.02 (0.141)

MSES is the average of ‖β̂Sβ0S‖ for nonzero coefficients. MSEN is the average of ‖β̂Nβ0N‖ for zero coefficients. TP is the number of correctly selected variables, and FP is the number of incorrectly selected variables. The standard error of each measure is also reported.

3. Focused GMM

3.1. Definition

Because of the presence of endogenous regressors, we introduce an instrumental variable (IV) regression model. Consider a more general nonlinear model:

E[g(Y,XSTβ0S)|W]=0, (3.1)

where Y stands for the dependent variable; g :ℝ × ℝ → ℝ is a known function. For simplicity, we require g be one-dimensional, and should be thought of as a possibly nonlinear residual function. Our result can be naturally extended to a multi-dimensional g function. Here W is a vector of observed random variables, known as instrumental variables.

Model (3.1) is called a conditional moment restricted model, which has been extensively studied in the literature, e.g., Newey (1993), Donald et al. (2009), Kitamura et al (2004). The high-dimensional model is also closely related to the semi/nonparametric model estimated by sieves with a growing sieve dimension, e.g., Ai and Chen (2003). Recently van de Geer (2008) and Fan and Lv (2011) considered generalized linear models without endogeneity. Some interesting examples of the generalized linear model that fit into (3.1) are:

  • linear regression, g(t1, t2) = t1t2;

  • logit model, g(t1, t2) = t1 − exp(t2)/(1 + exp(t2));

  • probit model, g(t1, t2) = t1 − Φ(t2) where Φ(·) denotes the standard normal cumulative distribution function.

Let (f1, …, fp) and (h1,…, hp) be two different sets of transformations of W, which can be taken as a large number of series basis, e.g., B-splines, Fourier series, polynomials (see Chen 2007 for discussions of the choice of sieve functions). Here each fj and hj are scalar functions. Write F = (f1(W),…, fp(W))T, and H = (h1(W),…, hp(W))T. The conditional moment restriction (3.1) then implies that

E[g(Y,XSTβ0S)FS]=0,andE[g(Y,XSTβ0S)HS]=0, (3.2)

where FS and HS are the subvectors of F and H whose supports are on the oracle set S = {jp : β0j ≠ 0}. In particular, when all the components of XS are known to be exogenous, we can take F = X and H = X2 (the vector of squares of X taken coordinately), or H = cos(X) + 1 if X is a binary variable. A typical estimator based on moment conditions like (3.2) can be obtained via the generalized method of moments (GMM, Hansen 1982). However, in the problem considered here, (3.2) cannot be used directly to construct the GMM criterion function, because the identities of XS are unknown.

Remark 3.1

One seemingly working solution is to define V as a vector of transformations of W, for instance V = F, and employ GMM to the moment condition E[g(Y, XT β0)V] = 0. However, one has to take dim(V) ≥ dim(β) = p to guarantee that the GMM criterion function has a unique minimizer (in the linear model for instance). Due to pn, the dimension of V is too large, and the sample analogue of the GMM criterion function may not converge to its population version due to the accumulation of high-dimensional estimation errors.

Let us introduce some additional notation. For any β ∈ ℝp/{0}, and i = 1, …, n, define r = |β|0-dimensional vectors

Fi(β) = (fl1(Wi),…, flr(Wi))T and Hi(β) = (hl1(Wi),…, hlr(Wi))T, where (l1, …, lr) are the indices of nonzero components of β. For example, if p = 3 and β = (−1, 0, 2)T, then Fi(β) = (f1(Wi), f3(Wi))T, and Hi(β) = (h1(Wi), h3(Wi))T, i ≤ n.

Our focused GMM (FGMM) loss function is defined as

LFGMM(β)=j=1pI(βj0){wj1[1ni=1ng(Yi,XiTβ)fj(Wi)]2+wj2[1ni=1ng(Yi,XiTβ)hj(Wi)]2}, (3.3)

where wj1 and wj2 are given weights. For example, we will take wj1=1/var^(fj(W)) and wj2=1/var^(hj(W)) to standardize the scale (here var^ represents the sample variance). Writing in the matrix form, for Vi(β) = (Fi(βT,H i(β)T)T,

LFGMM(β)=[1ni=1ng(Yi,XiTβ)Vi(β)]TJ(β)[1ni=1ng(Yi,XiTβ)Vi(β)],

where J(β) = diag{wl11, …, wlr1, wl12, …, wlr2}.4

Unlike the traditional GMM, the “working instrumental variables” V(β) depend irregularly on the unknown β. As to be further explained, this ensures the dimension reduction, and allows to focus only on the equations with the IV whose support is on the oracle space, and is therefore called the focused GMM or FGMM for short.

We then define the FGMM estimator by minimizing the following criterion function:

QFGMM(β)=LFGMM(β)+j=1pPn(|βj|). (3.4)

Sufficient conditions on the penalty function Pn(|βj|) for the oracle property will be presented in Section 4. Penalization is needed because otherwise small coefficients in front of unimportant variables would be still kept in minimizing LFGMM (β). As to become clearer in Section 6, the FGMM focuses on the model selection and estimation consistency without paying much effort to the efficient estimation of β0S.

3.2. Rationales behind the construction of FGMM

3.2.1. Inclusion of V(β)

We construct the FGMM criterion function using

V(β)=(F(β)T,H(β)T)T.

A natural question arises: why not just use one set of IV's so that V(β) = F(β)? We now explain the rationale behind the inclusion of the second set of instruments H(β). To simplify notation, let Fij = fj(Wi) and Hij = hj(Wi) for jp and in. Then Fi = (Fi1,…, Fip) and Hi = (Hi1,…, Hip). Also write Fj = fj (W) and Hj = hj (W) for jp.

Let us consider a linear regression model (2.2) as an example. If H(β) were not included and V(β) = F(β) had been used, the GMM loss function would have been constructed as

Lv(β)=1ni=1n(YiXiTβ)Fi(β)2, (3.5)

where for the simplicity of illustration, J(β) is taken as an identity matrix. We also use the L0-penalty Pn(|βj|) = λnI(|βj|≠0) for illustration. Suppose that the true β0=(β0ST,0,,0)T where only the first s components are nonzero and that s > 1. If we, however, restrict ourselves to βp = (0, …, 0, βp), the criterion function now becomes

QFGMM(βp)=[1ni=1n(YiXipβp)Fip]2+λn.

It is easy to see its minimum is just λn. On the other hand, if we optimize QFGMM on the oracle space β=(βST,0)T, then

minβ=(βST,0)T,βS,j0QFGMM(β)sλn.

As a result, it is inconsistent for variable selection.

The use of L0-penalty is not essential in the above illustration. The problem is still present if the L1-penalty is used, and is not merely due to the biasedness of L1-penalty. For instance, recall that for the SCAD penalty with hyper parameter (a, λn), Pn(·) is non-decreasing, and Pn(t)=(a+1)2λn2 when tn. Given that minjS|β0j|≫λn,

QFGMM(β0)jSPn(|β0j|)sPn(minjS|β0j|)=(a+1)2λn2s.

On the other hand, QFGMM(βp)=Pn(|βp|)(a+1)2λn2 which is strictly less than QFGMM(β0). So the problem is still present when an asymptotically unbiased penalty (e.g., SCAD, MCP) is used.

Including an additional term H(β) in V(β) can overcome this problem. For example, if we still restrict to βp = (0,…, βp) but include an additional but different IV Hip, the criterion function then becomes, for the L0 penalty:

QFGMM(βp)=[1ni=1n(YiXipβp)Fip]2+[1ni=1n(YiXipβp)Hip]2+λn.

In general, the first two terms cannot achieve op(1) simultaneously as long as the two sets of transformations {fj(·)} and {hj(·)} are fixed differently. so long as n is large and

(EXpFp)1E(YFp)(EXpHp)1E(YHp). (3.6)

As a result, QFGMM(βp) is bounded away from zero with probability approaching one.

To better understand the behavior of QFGMM (β), it is more convenient to look at the population analogues of the loss function. Because the number of equations in

E[(YXTβ)F(β)]=0andE[(YXTβ)H(β)]=0 (3.7)

is twice as many as the number of unknowns (nonzero components in β), if we denote as the support of β, then (3.7) has a solution only when (EFSXST)1E(YFS)=(EHSXST)1E(YHS), which does not hold in general unless = S, the index set of the true nonzero coefficients. Hence it is natural for (3.7) to have a unique solution β = β0. As a result, if we define

G(β)=E(YXTβ)F(β)2+E(YXTβ)H(β)2,

the population version of LFGMM, then as long as β is not close to β0, G should be bounded away from zero. Therefore, it is reasonable for us to assume that for any δ > 0, there is γ(δ) > 0 such that

infββ0>δ,β0G(β)>γ(δ). (3.8)

On the other hand, E(ε|W)=E(YXSTβ0S|W)=0 implies G(β0) = 0.

Our FGMM loss function is essentially a sample version of G(β), so minimizing LFGMM(β) forces the estimator to be close to β0, but small coefficients in front of unimportant but exogenous regressors may still be allowed. Hence a concave penalty function is added to LFGMM to define QFGMM.

3.2.2. Indicator function

Another question readers may ask is that why not define LFGMM(β) to be, for some weight matrix J,

[1ni=1ng(Yi,XiTβ)Vi]TJ[1ni=1ng(Yi,XiTβ)Vi], (3.9)

that is, why not replace the irregular β-dependent V(β) with V, and use the entire 2p-dimensional V = (FT, HT)T as the IV? This is equivalent to the question why the indicator function in (3.3) cannot be dropped.

The indicator function is used to prevent the accumulation of estimation errors under the high dimensionality. To see this, rewrite (3.9) to be:

j=1p1var^(Fj)(1ni=1ng(Yi,XiTβ)Fij)2+1var^(Hj)(1ni=1ng(Yi,XiTβ)Hij)2.

Since dim(Vi) = 2pn, even if each individual term evaluated at β = β0 is Op(1n), the sum of p terms would become stochastically unbounded. In general, (3.9) does not converge to its population analogue when pn because the accumulation of high-dimensional estimation errors would have a non-negligible effect.

In contrast, the indicator function effectively reduces the dimension and prevents the accumulation of estimation errors. Once the indicator function is included, the proposed FGMM loss function evaluated at β0 becomes:

jS1var^(Fj)(1ni=1ng(Yi,XiTβ0)Fij)2+1var^(Hj)(1ni=1ng(Yi,XiTβ0)Hij)2,

which is small because E[g(Y, XTβ0)FS] = E[g(Y, XTβ0)HS] = 0 and that there are only s = |S|0 terms in the summation.

Recently, there has been growing literature on the shrinkage GMM, e.g., Caner (2009), Caner and Zhang (2013), Liao (2013), etc, regarding estimation and variable selection based on a set of moment conditions like (3.2). The model considered by these authors is restricted to either a low-dimensional parameter space or a low-dimensional vector of moment conditions, where there is no such a problem of error accumulations.

4. Oracle Property of FGMM

FGMM involves a non-smooth loss function. In the appendix, we develop a general asymptotic theory for high-dimensional models to accommodate the non-smooth loss function.

Our first assumption defines the penalty function we use. Consider a similar class of folded concave penalty functions as that in Fan and Li (2001).

For any β = (β1,…, βs)T ∈ ℝs, and |βj| 0, j = 1,…, s, define

η(β)=limsup0+maxjssupt1<t2(t1,t2)(|βj|,|βj|+)Pn(t2)Pn(t1)t2t1, (4.1)

which is maxjsPn(|βj|) if the second derivative of Pn is continuous. Let

dn=12min{|β0j|:β0j0,j=1,,p}

represent the strength of signals.

Assumption 4.1

The penalty function Pn(t) : [0, ∞) ℝ satisfies:

  1. Pn(0) = 0

  2. Pn(t) is concave, non-decreasing on [0, ∞), and has a continuous derivative Pn(t) when t > 0.

  3. sPn(dn)=o(dn).

  4. There exists c > 0 such that supβ∈B(β0S,cdn) η(β) = o(1).

These conditions are standard. The concavity of Pn(·) implies that η(β) ≥ 0 for all β ∈ ℝs. It is straightforward to check that with properly chosen tuning parameters, the Lq penalty (for q ≤ 1), hard-thresholding (Antoniadis 1996), SCAD (Fan and Li 2001), and MCP (Zhang 2010) all satisfy these conditions. As thoroughly discussed by Fan and Li (2001), a penalty function that is desirable for achieving the oracle properties should result in an estimator with three properties: unbiasedness, sparsity and continuity (see Fan and Li 2001 for details). These properties motivate the needs of using a folded concave penalty.

The following assumptions are further imposed. Recall that for jp, Fj = fj (W) and Hj = hj (W).

Assumption 4.2

  1. The true parameter β0 is uniquely identified by E(g(Y, XT β0)|W) = 0.

  2. (Y1, X1),…, (Yn, Xn) are independent and identically distributed.

Remark 4.1

Condition (i) above is standard in the GMM literature (e.g., Newey 1993, Donald et al. 2009, Kitamura et al. 2004). This condition is closely related to the “over-identifying restriction”, and ensures that we can always find two sets of transformations F and H such that the equations in (3.2) are uniquely satisfied by βS = β0S. In linear models, this is a reasonable assumption, as discussed in Section 3.2. In nonlinear models, however, requiring the identifiability from either E(g(Y, XTβ0)|W) = 0 or (3.2) may be restrictive. Indeed, Dominguez and Lobato 2004) showed that the identification condition in (i) may depend on the marginal distributions of W. Furthermore, in nonparametric regression problems as in Bickel et al. (2009) and Ai and Chen (2003), the sufficient condition of Condition (i) is even more complicated, which also depends on the conditional distribution of X|W, and is known to be statistically untestable (see Newey and Powell 2003, Canay et al 2013).

Assumption 4.3

There exist b1, b2, b3 > 0 and r1, r2, r3 > 0 such that for any t > 0,

  1. P(|g(Y, XTβ0)| > t) ≤ exp(−(t/b1)r1),

  2. maxlp P(|Fl| > t) ≤ exp(−(t/b2)r2), maxlp P(|Hl > t) ≤ exp(−(t/b3)r3).

  3. minjS var(g(Y, XT β0)Fj) and minjS var(g(Y, XT β0)Hj) are bounded away from zero.

  4. var(Fj) and var(Hj) are bounded away from both zero and infinity uniformly in j = 1,…, p and p ≥ 1.

We will assume g(·,·) to be twice differentiable, and in the following assumptions, let

m(t1,t2)=g(t1,t2)t2,q(t1,t2)=2g(t1,t2)t22,VS=(FSHS).

Assumption 4.4

  1. g(·,·) is twice differentiable.

  2. supt1, t2 |m(t1, t2)| < ∞, and supt1, t2 |q(t1, t2)| <∞.

It is straightforward to verify Assumption 4.4 for linear, logistic and probit regression models.

Assumption 4.5

There exist C1 > 0 and C2 > 0 such that

λmax[(Em(Y,XSTβ0S)XSVST)(Em(Y,XSTβ0S)XSVST)T]<C1.λmin[(Em(Y,XSTβ0S)XSVST)(Em(Y,XSTβ0S)XSVST)T]<C2;

These conditions require that the instrument VS be not weak, that is, VS should not be weakly correlated with the important regressors. In the generalized linear model, Assumption 4.5 is satisfied if proper conditions on the design matrices are imposed. For example, in the linear regression model and probit model, we assume the eigenvalues of (EXSVST)(EXSVST)T and (Eϕ(XTβ0)XSVST)(Eϕ(XTβ0)XSVST)T are bounded away from both zero and infinity respectively, where ϕ(·) is the standard normal density function. Conditions in the same spirit are also assumed in, e.g., Bradic et al. (2011), and Fan and Lv (2011).

Define

ϒ=var(g(Y,XSTβ0S)VS). (4.2)

Assumption 4.6

  1. For some c > 0, λmin(ϒ) > c.

  2. sPn(dn)+s(logp)/n+s3(logs)/n=o(Pn(0+)), Pn(dn)s2=O(1), and s(logp)/n=o(dn).

  3. Pn(dn)=o(1/ns) and supββ0S ‖≤dn/4η(β)=o((s log p)−1/2).

  4. maxjSEm(y,XTβ0)XjVS(logs)/n=o(Pn(0+)).

This assumption imposes a further condition jointly on the penalty, the strength of the minimal signal and the number of important regressors. Condition (i) is needed for the asymptotic normality of the estimated nonzero coefficients. When either SCAD or MCP is used as the penalty function with a tuning parameter λn, Pn(dn)=supββ0Sdn/4η(β)=0 and Pn(0+)=λn when λn = o(dn). Thus Conditions (ii)-(iv) in the assumption are satisfied as long as slogp/n+s3logs/nλndn. This requires the signal dn be strong and s be small compared to n. Such a condition is needed to achieve the variable selection consistency.

Under the foregoing regularity conditions, we can show the oracle property of a local minimizer of QFGMM (3.4).

Theorem 4.1

Suppose s3 log p = o(n). Under Assumptions 4.1-4.6, there exists a local minimizer β^=(β^ST,β^NT)T of QFGMM(β) with β̂S and β̂N being sub-vectors of β̂ whose coordinates are in S and Sc respectively, such that:

  1. nαTΓ1/2(β^Sβ0S)dN(0,1),

    for any unit vector α ∈ ℝs, ‖α‖ = 1, where A=Em(Y,XTβ0)XSVST,

    Γ=4AJ(β0)ϒJ(β0)AT,and=2AJβ0)AT.
  2. limnP(β^N=0)=1.

    In addition, the local minimizer β̂ is strict with probability at least 1 – δ for an arbitrarily small δ > 0 and all large n.

  3. Let Ŝ = {jp : β̂j ≠ 0}. Then

    P(S^=S)1.

Remark 4.2

As was shown in an earlier version of this paper Fan and Liao (2012), when it is known that E[g(Y, XTβ0)|XS] = 0 but likely E[g(Y, XTβ0)|X] ≠ 0, we can take V = (FT, HT)T to be transformations of X that satisfy Assumptions 4.3-4.6. In this way, we do not need an extra instrumental variable W, and Theorem 4.1 still goes through, while the traditional methods (e.g., penalized least squares in the linear model) can still fail as shown by Theorem 2.2. In the high-dimensional linear model, compared to the classical assumption: E(ε|X) = 0, our condition E(ε| XS) = 0 is relatively easier to validate as XS is a low-dimensional vector.

Remark 4.3

We now explain our required lower bound on the signal slogp/n=o(dn). When a penalized regression is used, which takes the form minβLn(β)+j=1pPn(|βj|), it is required that if Ln (β) is differentiable, maxjS|Ln(β0)/βj|=o(Pn(0+)). This often leads to a requirement of the lower bound of dn. Therefore, such a lower bound of dn depends on the choice of both the loss function Ln(β) and the penalty. For instance, in the linear model when least squares with a SCAD penalty is employed, this condition is equivalent to logp/n=o(dn). It is also known that the adaptive lasso penalty requires the minimal signal to be significantly larger than logp/n(Huang, Ma and Zhang 2008). In our framework, the requirement slogp/n=o(dn) arises from the use of the new FGMM loss function. Such a condition is stronger than that of the least squares loss function, which is the price paid to achieve variable selection consistency in the presence of endogeneity. This condition is still easy to satisfy as long as s grows slowly with n.

Remark 4.4

Similar to the “irrpresentable condition” for Lasso, the FGMM requires important and unimportant explanatory variables not be strongly correlated. This is fulfilled by Assumption 4.6(iv). For instance, in the linear model and VS contains XS as in our earlier version, this condition implies maxjSEXjXSlogs/n=o(λn). Strong correlation between (XS, XN) is also ruled out by the identifiability condition Assumption 4.2. To illustrate the idea, consider a case of perfect linear correlation: XSTαXNTδ=0 for some (α, δ) with δ ≠ 0. Then, XTβ0=XST(β0Sα)+XNTδ. As a result, the FGMM can be variable selection inconsistent because β0 and (β0Sα, δ) are observationally equivalent, violating Assumption 4.2.

5. Global minimization

With the over identification condition, we can show that the local minimizer in Theorem 4.1 is nearly global. To this end, define an l ball centered at β0 with radius δ:

Θδ={βRp:|βiβ0i|<δ,i=1,,p}.

Assumption 5.1 (over-identification)

For any δ > 0, there is γ > 0 such that

limnP(infβΘδ{0}1ni=1ng(Yi,XiTβ)Vi(β)2>γ)=1.

This high-level assumption is hard to avoid in high-dimensional problems. It is the empirical counterpart of (3.8). In classical low-dimensional regression models, this assumption has often been imposed in the econometric literature, e.g., Andrews (1999), Chernozhukov and Hong (2003), among many others. Let us illustrate it by the following example.

Example 5.1

Consider a linear regression model of low dimensions: E(YXSTβ0S|W)=0, which implies E[(YXSTβ0S)FS]=0 and E[(YXSTβ0S)HS]=0 where p is either bounded or slowly diverging with n. Now consider the following problem:

minβ0G(β)minβ0E(YXTβ)F(β)2+E(YXTβ)H(β)2.

Once [EFSXST]1E[FSY][EHSXST]1E[HSY] for all index set S, the objective function is then minimized to zero uniquely by β = β0. Moreover, for any δ > 0 there is γ > 0 such that when β ∉ Θδ ∪ {0}, we have G(β) > γ > 0. Assumption 5.1 then follows from the uniform weak law of large number: with probability approaching one, uniformly in (β ∉ Θδ ∪ {0},

1ni=1nFi(β)(YiXiTβ)2+1ni=1nHi(β)(YiXiTβ)2>γ/2.

When p is much larger than n, the accumulation of the fluctuations from using the law of large number is no longer negligible. It is then challenging to show that ‖E[g(Yi, XTβ)V(β)] ‖ is close to 1ni=1ng(Yi,XiTβ)Vi(β) uniformly for high-dimensional β's, which is why we impose Assumption 5.1 on the empirical counterpart instead of the population.

Theorem 5.1

Assume maxjSPn(|β0j|)=o(s1). Under Assumption 5.1 and those of Theorem 4.1, the local minimizer β̂ in Theorem 4.1 satisfies: for any δ > 0, there exists γ > 0,

limnP(QFGMM(β^)+γ<infβΘδ{0}QFGMM(β))=1.

The above theorem demonstrates that β̂ is a nearly global minimizer. For SCAD and MCP penalties, the condition maxjSPn(|β0j|)=o(s1) holds when λn = o(s−1), which is satisfied if s is not large.

Remark 5.1

We exclude the set {0} from the searching area in both Assumption 5.1 and Theorem 5.1 because we do not include the intercept in the model so X(0) = 0 by definition, and hence QFGMM(0) = 0. It is reasonable to believe that zero is not close to the true parameter, since we assume there should be at least one important regressor in the model. On the other hand, if we always keep X1 = 1 to allow for an intercept, there is no need to remove {0} in either Assumption 5.1 or the above theorem. Such a small change is not essential.

Remark 5.2

Assumption 5.1 can be slightly relaxed so that γ is allowed to decay slowly at a certain rate. The lower bound of such a rate is given by Lemma D.2 in the appendix. Moreover, Theorem 5.1 is based on an over-identification assumption, which is essentially different from the global minimization theory in the recent high-dimensional literature, e.g., Zhang (2010), Bühlmann and van de Geer (2011, ch 9), and Zhang and Zhang (2012).

6. Semi-parametric efficiency

The results in Section 5 demonstrate that the choice of the basis functions {fj, hj}jp forming F and H influences the asymptotic variance of the estimator. The resulting estimator is in general not efficient. To obtain a semi-parametric efficient estimator, one can employ a second step post-FGMM procedure. In the linear regression, a similar idea has been used by Belloni and Chernozhukov (2013).

After achieving the oracle properties in Theorem 4.1, we have identified the important regressors with probability approaching one, that is,

S^={j:β^j0},X^S=(Xj:jS^),P(S^=S)1.

This reduces the problem to a low-dimensional problem. For simplicity, we restrict s = O(1). The problem of constructing semi-parametric efficient estimator (in the sense of Newey (1990) and Bickel et al. (1998)) in a low-dimensional model

E[g(Y,XSTβ0S)|W]=0

has been well studied in the literature (see, for example, Chamberlain (1987), Newey (1993)). The optimal instrument that leads to the semi-parametric efficient estimation of β0S is given by D(W)σ(W)−2, where

D(W)=E(g(Y,XSTβ0S)βS|W),σ(W)2=E(g(Y,XSTβ0S)2|W).

Newey (1993) showed that the semi-parametric efficient estimator of β0S can be obtained by GMM with the moment condition:

E[g(Y,XSTβ0S)σ(W)2D(W)]=0. (6.1)

In the post-FGMM procedure, we replace XS with the selected S obtained from the first-step penalized FGMM. Suppose there exist consistent estimators (W) and σ̂ (W)2 of D(W) and σ(W)2. Let us assume the true parameter ‖β0S < M for a large constant M > 0. We then estimate β0S by solving

ρn(βS)=1ni=1ng(Yi,X^iSTβS)σ^(Wi)2D^(Wi)=0, (6.2)

on {βS : ‖β0SM}, and the solution is assumed to be unique.

Assumption 6.1

  1. There exist C1 > 0 and C2 > 0 so that

    C1<infwχσ(w)2supwχσ(w)2<C2.

    In addition, there exist σ̂(w)2 and (w) such that

    supwχ|σ^(w)2σ(w)2|=op(1),andsupwχD^(w)D(w)=op(1)

    where χ is the support of W.

  2. E(supβMg(Y,XSTβS)4)<.

The consistent estimators for D(w) and σ(w)2 can be obtained in many ways. We present a few examples below.

Example 6.1 (Homoskedasticity)

Suppose Y=h(XSTβ0S)+ε for some nonlinear function h(·). Then σ(w)2 = E(ε2|W = w) = σ2, which does not depend on w under homoskedasticity. In this case, equations (6.1) and (6.2) do not depend on σ2.

Example 6.2 (Simultaneous linear equations)

In the simultaneous linear equation model, XS linearly depends on W as:

g(Y,XSTβS)=YXSTβS,XS=ΠW+u

for some coefficient matrix Π, where u is independent of W. Then D(w) = E(XS|W = w) = Πw. Let = (S1, …, Sn), = (W1, …, Wn). We then estimate D(w) by Π̂w, where Π̂ = (X̂W̄T)(W̄W̄)−1.

Example 6.3 (Semi-nonparametric estimation)

We can also assume a semi-parametric structure on the functional forms of D(w) and σ(w)2:

D(w)=D(w;θ1),σ(w)2=σ2(w;θ2),

where D(·;θ1) and σ2(·;θ2) are semi-parametric functions parameterized by θ1 and θ2. Then D(w) and σ(w)2 are estimated using a standard semi-parametric method. More generally, we can proceed by a pure nonparametric approach via respectively regressing g(Y,X^STβ^S)/βS and g(Y,X^STβ^S)2 on W, provided that the dimension of W is either bounded or growing slowly with n (see Fan and Yao, 1998).

Theorem 6.1

Suppose s = O(1), Assumption 6.1 and those of Theorem 4.1 hold. Then

n(β^Sβ0S)dN(0,[E(σ(W)2D(W)D(W)T)]1),

and [E(σ(W)−2D(W)D(W)T)]–1 is the semi-parametric efficiency bound in Chamberlain (1987).

7. Implementation

We now discuss the implementation for numerically minimizing the penalized FGMM criterion function.

7.1. Smoothed FGMM

As we previously discussed, including an indicator function benefits us in dimension reduction. However, it also makes LFGMM unsmooth. Hence, minimizing QFGMM(β) = LFGMM(β)+Penalty is generally NP-hard.

We overcome this discontinuity problem by applying the smoothing technique as in Horowitz (1992) and Bondell and Reich (2012), which approximates the indicator function by a smooth kernel K : (−∞,∞) → ℝ that satisfies

  1. 0 ≤ K(t) < M for some finite M and all t ≥ 0.

  2. K(0) = 0 and lim|t|→∞ K(t) = 1.

  3. lim sup|t|→∞ |K′(t)t| = 0, and lim sup|t|→∞ |K″(t)t2| < ∞.

We can set K(t)=F(t)F(0)1F(0), where F(t) is a twice differentiable cumulative distribution function. For a pre-determined small number hn, LFGMM is approximated by a continuous function LK(β) with the indicator replaced by K(βj2/hn). The objective function of the smoothed FGMM is given by

QK(β)=LK(β)+j=1pPn(|βj|).

As hn→0+, K(βj2/hn) converges to I(βj≠0), and hence LK(β) is simply a smoothed version of LFGMM (β). As an illustration, Figure 1 plots such a function.

Fig 1. K(t2hn)=exp(t2/hn)1exp(t2/hn)+1 as an approximation to I(t≠0).

Fig 1

Smoothing the indicator function is often seen in the literature on high-dimensional variable selections. Recently, Bondell and Reich (2012) approximate I(t≠0) by (hn+1)thn+t to obtain a tractable non-convex optimization problem. Intuitively, we expect that the smoothed FGMM should also achieve the variable selection consistency. Indeed, the following theorem formally proves this claim.

Theorem 7.1

Suppose hn1γ=o(dn2) for a small constant γ ∈ (0, 1). Under the assumptions of Theorem 4.1 there exists a local minimizer β̂′ of the smoothed FGMM QK(β) such that, for S^={jp:β^j0},

P(S^=S)1.

In addition, the local minimizer β̂′ is strict with probability at least 1 − δ for an arbitrarily small δ > 0 and all large n.

The asymptotic normality of the estimated nonzero coefficients can be established very similarly to that of Theorem 4.1, which is omitted for brevity.

7.2. Coordinate descent algorithm

We employ the iterative coordinate algorithm for the smoothed FGMM minimization, which was used by Fu (1998), Daubechies et al. (2004), Fan and Lv (2011), etc. The iterative coordinate algorithm minimizes one coordinate of β at a time, with other coordinates kept fixed at their values obtained from previous steps, and successively updates each coordinate. The penalty function can be approximated by local linear approximation as in Zou and Li (2008).

Specifically, we run the regular penalized least squares to obtain an initial value, from which we start the iterative coordinate algorithm for the smoothed FGMM. Suppose β(l) is obtained at step l. For k ∈ {1, …, p}, denote by β(k)(l) a (p − 1)-dimensional vector consisting of all the components of (β(l) but βk(l). Write (β(k)(l),t) as the p-dimensional vector that replaces βk(l) with t. The minimization with respect to t while keeping β(k)(l) fixed is then a univariate minimization problem, which is not difficult to implement. To speed up the convergence, we can also use the second order approximation of LK(β(k)(l),t) along the kth component at βk(l):

LK(β(k)(l),t)LK(β(l))+LK(β(l))βk(tβk(l))+122LK(β(l))βk2(tβk(l))2LK(β(l))+L^K(β(k)(l),t), (7.1)

where L^K(β(k)(l),t) is a quadratic function of t. We solve for

t=argmintL^K(β(k)(l),t)+Pn(|βk(l)|)|t|, (7.2)

which admits an explicit analytical solution, and keep the remaining components at step l. Accept t* as an updated kth component of β(l) only if LK(β(l))+j=1pPn(|βj(l)|) strictly decreases. The coordinate descent algorithm runs as follows.

  1. Set l = 1. Initialize β(1) = β̂*, where β̂* solves

    minβRp1ni=1n[g(Yi,XiTβ)]2+j=1pPn(|βj|)

    using the coordinate descent algorithm as in Fan and Lv (2011).

  2. Successively for k = 1, …, p, let t* be the minimizer of

    mintL^K(β(k)(l),t)+Pn(|βk(l)|)|t|.

    Update βk(l) as t* if

    LK(β(k)(l),t)+Pn(|t|)<LK(β(l))+Pn(|βk(l)|).

    Otherwise set βk(l)=βk(l1). Increase l by one when k = p.

  3. Repeat Step 2 until | QK(β(l))−QK (β(l+1))| < ε, for a pre-determined small ε.

When the second order approximation (7.1) is combined with SCAD in Step 2, the local linear approximation of SCAD is not needed. As demonstrated in Fan and Li (2001), when Pn(t) is defined using SCAD, the penalized optimization of the form minβR12(zβ)2+ΛPn(|β|) has an analytical solution.

We can show that the evaluated objective values {QK(β(l))}l≥1 is a bounded Cauchy sequence. Hence for an arbitrarily small ε > 0, the above algorithm stops after finitely many steps. Let M(β) denote the map defined by the algorithm from β(l) to β(l+1). We define a stationary point of the function QK(β) to be any point β at which the gradient vector of QK(β) is zero. Similar to the local linear approximation of Zou and Li (2008), we have the following result regarding the property of the algorithm.

Theorem 7.2

The sequence {QK (β(l))}l≥1 is a bounded non-increasing Cauchy sequence. Hence for any arbitrarily small ε > 0, the coordinate descent algorithm will stop after finitely many iterations. In addition, if QK (β) = QK (M(β)) only for stationary points of QK(·) and if β* is a limit point of the sequence {(β(l)) l≥1, then β* is a stationary point of QK (β).

Theoretical analysis of non-convex regularization in the recent decade has focused on numerical procedures that can find local solutions (Hunter and Li 2005, Kim et al. 2008, Brehenry and Huang 2011). Proving that the algorithm achieves a solution that possesses the desired oracle properties is technically difficult. Our simulated results demonstrate that the proposed algorithm indeed reaches the desired sparse estimator. Further investigation along the lines of Zhang and Zhang (2012) and Loh and Wainwright (2013) is needed to investigate the statistical properties of the solution to non-convex optimization problems, which we leave as future research.

8. Monte Carlo Experiments

8.1. Endogeneity in both important and unimportant regressors

To test the performance of FGMM for variable selection, we simulate from a linear model:

Y=XTβ0+ε,(β01,,β05)=(5,4,7,2,1.5);β0j=0,for6jp

with p = 50 or 200. Regressors are classified as being exogenous (independent of ε) and endogenous. For each component of X, we write Xj=Xje if Xj is endogenous, and Xj=Xjx if Xj is exogenous, and Xje and Xjx are generated according to:

Xje=(Fj+Hj+1)(3ε+1),Xjx=Fj+Hj+uj,

where {ε, u1, …, up} are independent N(0, 1). Here F = (F1, …, Fp)T and H = (H1, …, Hp)T are the transformations (to be specified later) of a three-dimensional instrumental variable W = (W1, W2, W3)TN3(0, I3). There are m endogenous variables (X1, X2, X3, X6, …, X2+m)T, with m = 10 or 50. Hence three of the important regressors (X1, X2, X3) are endogenous while two are exogenous (X4, X5).

We apply the Fourier basis as the working instruments:

F=2{sin(jπW1)+sin(jπW2)+sin(jπW3):jp},H=2{cos(jπW1)+cos(jπW2)+cos(jπW3):jp}.

The data contain n = 100 i.i.d. copies of (Y, X, F, H). PLS and FGMM are carried out separately for comparison. In our simulation we use SCAD with pre-determined tuning parameters of λ as the penalty function. The logistic cumulative distribution function with h = 0.1 is used for smoothing:

F(t)=exp(t)1+exp(t),K(βj2h)=2F(βj2h)1.

There are 100 replications per experiment. Four performance measures are used to compare the methods. The first measure is the mean standard error (MSES) of the important regressors, determined by the average of ‖β̂Sβ0S‖ over the 100 replications, where S = {1, …, 5}. The second measure is the average of the MSE of unimportant regressors, denoted by MSEN. The third measure is the number of correctly selected non-zero coefficients, that is, the true positive (TP), and finally, the fourth measure is the number of incorrectly selected coefficients, the false positive (FP). In addition, the standard error over the 100 replications of each measure is also reported. In each simulation, we initiate β(0) = (0,…, 0)T, and run a penalized least squares (SCAD(λ)) for λ = 0.5 to obtain the initial value for the FGMM procedure. The results of the simulation are summarized in Table 2, which compares the performance measures of PLS and FGMM.

Table 2. Endogeneity in both important and unimportant regressors, n = 100.

PLS FGMM
λ = 1 λ =3 λ = 4 λ = 0.08 λ = 0.1 λ = 0.3 post-FGMM
p = 50 m = 10
MSES 0.190 (0.102) 0.525 (0.283) 0.491 (0.328) 0.106 (0.051) 0.097 (0.043) 0.102 (0.037) 0.088 (0.026)
MSEN 0.171 (0.059) 0.240 (0.149) 0.183 (0.149) 0.090 (0.030) 0.085 (0.035) 0.048 (0.034)
TP 5 (0) 5 (0) 4.97 (0.171) 5 (0) 5 (0) 5 (0)
FP 27.69 (6.260) 14.63 (5.251) 10.37 (4.539) 3.76 (1.093) 3.5 (1.193) 1.63 (1.070)

p = 200 m = 50
MSES 0.831 (0.787) 0.966 (0.595) 1.107 (0.678) 0.111 (0.048) 0.104 (0.041) 0.231 (0.431) 0.092 (0.032)
MSEN 1.286 (1.333) 0.936 (0.799) 0.828 (0.656) 0.062 (0.018) 0.063 (0.021) 0.053 (0.075)
TP 5 (0) 4.9 (0.333) 4.73 (0.468) 5 (0) 5 (0) 4.94 (0.246)
FP 86.760 (27.41) 42.440 (15.08) 35.070 (13.84) 4.726 (1.358) 4.276 (1.251) 2.897 (2.093)

m is the number of endogenous regressors. MSES is the average of ‖β̂Sβ0S‖ for nonzero coefficients. MSEN is the average of ‖β̂Nβ0N‖ for zero coefficients. TP is the number of correctly selected variables; FP is the number of incorrectly selected variables, and m is the total number of endogenous regressors. The standard error of each measure is also reported.

PLS has non-negligible false positives (FP). The average FP decreases as the magnitude of the penalty parameter increases, however, with a relatively large MSES for the estimated nonzero coefficients, and the FP rate is still large compared to that of FGMM. The PLS also misses some important regressors for larger λ. It is worth noting that the larger MSES for PLS is due to the bias of the least squares estimation in the presence of endogeneity. In contrast, FGMM performs well in both selecting the important regressors, and in correctly eliminating the unimportant regressors. The average MSES of FGMM is significantly less than that of PLS since the instrumental variable estimation is applied instead. In addition, after the regressors are selected by the FGMM, the post-FGMM further reduces the mean squared error of the estimators.

8.2. Endogeneity only in unimportant regressors

Consider a similar linear model but only the unimportant regressors are endogenous and all the important regressors are exogenous, as designed in Section 2.2, so the true model is as the usual case without endogeneity. In this case, we apply (F, H) = (X, X2) as the working instruments for FGMM with SCAD(λ) penalty, and need only data X and Y = (Y1, …, Yn). We still compare the FGMM procedure with PLS. The results are reported in Table 3.

Table 3. Endogeneity only in unimportant regressors, n = 200.

PLS FGMM
λ = 0.1 λ = 0.5 λ = 1 λ = 0.05 λ = 0.1 λ = 0.2
p = 50
MSES 0.133 (0.043) 0.629 (0.301) 1.417 (0.329) 0.261 (0.094) 0.184 (0.069) 0.194 (0.076)
MSEN 0.068 (0.016) 0.072 (0.016) 0.095 (0.019) 0.001 (0.010) 0 (0) 0.001 (0.009)
TP 5 (0) 4.82 (0.385) 3.63 (0.504) 5 (0) 5 (0) 5 (0)
FP 35.36 (3.045) 8.84 (3.334) 2.58 (1.557) 0.08 (0.337) 0 (0) 0.02 (0.141)

p = 300
MSES 0.159 (0.054) 0.650 (0.304) 1.430 (0.310) 0.274 (0.086) 0.187 (0.102) 0.193 (0.123)
MSEN 0.107 (0.019) 0.071 (0.023) 0.086 (0.027) 5 × 10−4(0.006) 0 (0) 5 × 10−4(0.005)
TP 5 (0) 4.82 (0.384) 3.62 (0.487) 5 (0) 5 (0) 4.99 (0.100)
FP 210.47 (11.38) 42.78 (11.773) 7.94 (5.635) 0.11 (0.37) 0 (0) 0.01 (0.10)

It is clearly seen that even though only the unimportant regressors are endogenous, however, the PLS still does not seem to select the true model correctly. This illustrates the variable selection inconsistency for PLS even when the true model has no endogeneity. In contrast, the penalized FGMM still performs relatively well.

8.3. Weak minimal signals

To study the effect on variable selection when the strength of the minimal signal is weak, we run another set of simulations with the same data generating process as in Design 1 but we change β4 = −0.5 and β5 = 0.1, and keep all the remaining parameters the same as before. The minimal nonzero signal becomes |β5| = 0.1. Three of the important regressors are endogenous as in Design 1. Table 4 indicates that the minimal signal is so small that it is not easily distinguishable from the zero coefficients.

Table 4. FGMM for weak minimal signal β4 = −0.5, β5 = 0.1.

p = 50 m = 10 p = 200 m = 50
λ = 0.05 λ = 0.1 λ = 0.5 λ = 0.05 λ = 0.1 λ = 0.5
MSES 0.128 (0.020) 0.107 (0.000) 0.118 (0.056) 0.138 (0.061) 0.125 (0.074) 0.238 (0.154)
MSEN 0.155 (0.054) 0.097 (0.000) 0.021 (0.033) 0.134 (0.052) 0.108 (0.043) 0.084 (0.062)
TP 4.12 (0.327) 4 (0) 4 (0) 4.04 (0.281) 3.98 (0.141) 3.8 (0.402)
FP 4.93 (1.578) 5 (0) 2.08 (0.367) 4.72 (1.198) 4.3 (0.948) 1.95 (1.351)

9. Conclusion and Discussion

Endogeneity can arise easily in the high-dimensional regression due to a large pool of regressors, which causes the inconsistency of the penalized least-squares methods and possible false scientific discoveries. Based on the over-identification assumption and valid instrumental variables, we propose to penalize an FGMM loss function. It is shown that FGMM possesses the oracle property, and the estimator is also a nearly global minimizer.

We would like to point out that this paper focuses on correctly specified sparse models, and the achieved results are “pointwise” for the true model. An important issue is the uniform inference where the sparse model may be locally misspecified. While the oracle property is of fundamental importance for high-dimensional methods in many scientific applications, it may not enable us to make valid inference about the coefficients uniformly across a large class of models (Leeb and Pötscher 2008, Belloni et al. 2013)5. Therefore, the “post-double-selection” method with imperfect model selection recently proposed by Belloni et al. (2013) is important for making uniform inference. Research along that line under high-dimensional endogeneity is important and we shall leave it for the future agenda.

Finally, as discussed in Bickel et al. (2009) and van de Geer (2008), high-dimensional regression problems can be thought of as an approximation to a nonparametric regression problem with a “dictionary” of functions or growing number of sieves. Then in the presence of endogenous regressors, model (3.1) is closely related to the non-parametric conditional moment restricted model considered by, e.g., Newey and Powell (2003), Ai and Chen (2003), and Chen and Pouzo (2008). While the penalization in the latter literature is similar to ours, it plays a different role and is introduced for different purposes. It will be interesting to find the underlying relationships between the two models.

Supplementary Material

suppl

Acknowledgments

We would like to thank the anonymous reviewers, Associate Editor and Editor for their helpful comments that helped to improve the paper.

This project was supported by the National Science Foundation grant DMS-1206464 and the National Institute of General Medical Sciences of the National Institutes of Health through Grant Numbers R01GM100474 and R01-GM072611. The bulk of the research was carried out while Yuan Liao was a postdoctoral fellow at Princeton University.

Appendix A: Proofs for Section 2

Throughout the Appendix, C will denote a generic positive constant that may be different in different uses. Let sgn(·) denote the sign function.

A.1. Proof of Theorem 2.1

Proof. When β̂ is a local minimizer of Qn(β), by the Karush-Kuhn-Tucker (KKT) condition, ∀lp,

Ln(β^)βl+vl=0,

where vl=Pn(|β^l|)sgn(β^l) if β̂l ≠ 0; vl[Pn(0+),Pn(0+)] if β̂l = 0, and we denote Pn(0+)=limt0+Pn(t). By the monotonicity of Pn(t), we have |Ln(β^)/βl|Pn(0+). By Taylor expansion and the Cauchy-Schwarz inequality, there is β̃ on the segment joining β̂ and β0 so that, on the event β̂N = 0, (β̂jβ0j = 0 for all jS)

|Ln(β^)βlLn(β0)βl|=|j=1p2Ln(β)βlβj(β^jβ0j)|=|jS2Ln(β)βlβj(β^jβ0j)|.

The Cauchy-Schwarz inequality then implies that maxlp |∂Ln(β̂)/∂βl − ∂Ln(β0)/∂βl| is bounded by

maxl,jp|2Ln(β)βlβj|β^Sβ0S1maxl,jp|2Ln(β)βlβj|sβ^Sβ0S.

By our assumption, √sβ̂Sβ 0S‖ = op(1). Because P(β̂N = 0) → 1,

maxlp|Ln(β^)βlLn(β0)βl|p0. (A.1)

This yields that ∂Ln(β0)/∂βl = op(1).

A.2. Proof of Theorem 2.2

Proof. Let {Xil}i=1n be the i.i.d. data of Xl where Xl is an endogenous regressor. For the penalized LS, Ln(β)=1ni=1n(YiXiTβ)2. Under the theorem assumptions, by the strong law of large number βlLn(β0)=2ni=1nXil(YiXiTβ0)2E(Xlε) almost surely, which does not satisfy (2.1) of Theorem 2.1.

Appendix B: General Penalized Regressions

We present some general results for the oracle properties of penalized regressions. These results will be employed to prove the oracle properties for the proposed FGMM. Consider a penalized regression of the form:

minβpLn(β)+j=1pPn(|βj|),

Lemma B.1

Under Assumption 4.1, if (β = (β1,…, βs)T is such that maxjs |βjβ0S,j| ≤ dn, then

|j=1sPn(|βj|)Pn(|β0S,j|)|ββ0SsPn(dn).

Proof. By Taylor's expansion, there exists β* ( βj0 for each j) lying on the line segment joining β and β0S, such that

j=1s(Pn(|βj|)Pn(|β0S,j|)=(Pn(|β1|)sgn(β1),,Pn(|βs|)sgn(βs))T(ββ0S)ββ0SsmaxjsPn(|βj|).

Then min{|βj|:js}min{|β0S,j|:js}maxjs|βjβ0S,j|2dndn=dn.

Since Pn is non-increasing (as Pn is concave), Pn(|βj|)Pn(dn) for all js. Therefore j=1s(Pn(|βj|)Pn(|β0S,j|)ββ0SsPn(dn).

In the theorems below, with S = {j : β0j ≠ 0}, define a so-called “oracle space” = {β ∈ ℝp : βj = 0 if jS}. Write Ln(βS, 0) = Ln(β) for β=(βST,0)T. Let βS = (βS1,…, βSs) and

SLn(βS,0)=(Ln(βS,0)βS1,,Ln(βS,0)βSs)T.

Theorem B.1 (Oracle Consistency)

Suppose Assumption 4.1 holds. In addition, suppose Ln(βS, 0) is twice differentiable with respect to βS in a neighborhood of β0S restricted on the subspace , and there exists a positive sequence an = o(dn) such that:

  1. SLn(β0S,0)=Op(an).
  2. For any ε > 0, there is Cε > 0 so that for all large n,

    P(λmin(S2Ln(β0S,0))>Cε)>1ε. (B.1)
  3. For any ε > 0, δ > 0, and any nonnegative sequence αn = o(dn), there is N > 0 such that when n > N,

    P(supβSβ0SαnS2Ln(βS,0)S2Ln(β0S,0)Fδ)>1ε. (B.2)

Then there exists a local minimizer β^=(β^ST,0)T of

Qn(βS,0)=Ln(βS,0)+jSPn(|βj|)

such that β^Sβ0S=Op(an+sPn(dn)). In addition, for an arbitrarily small ε > 0, the local minimizer β̂ is strict with probability at least 1 − ε, for all large n.

Proof. The proof is a generalization of the proof of Theorem 3 in Fan and Lv (2011). Let kn=an+sPn(dn). It is our assumption that kn = o(1). Write Q1(βS) = Qn(βS, 0), and L1(βS) = Ln(βS, 0). In addition, write

L1(βS)=LnβS(βS,0),and2L1(βS)=2LnβSβST(βS,0).

Define Inline graphicτ = {β ∈ ℝs : ‖ ββ0S‖ ≤knτ} for some τ > 0. Let ∂ Inline graphicτ denote the boundary of Inline graphicτ. Now define an event

Hn(τ)={Q1(β0S)<minβSNτQ1(βS)}.

On the event Hn(τ), by the continuity of Q1, there exists a local minimizer of Q1 inside Inline graphicτ. Equivalently, there exists a local minimizer (β^ST,0)T of Qn restricted on ={β=(βST,0)T} inside {β=(βST,0)T:βSNτ}. Therefore, it suffices to show that ∀ε > 0, there exists τ > 0 so that P(Hn(τ)) > 1 − ε for all large n, and that the local minimizer is strict with probability arbitrarily close to one.

For any βS ∈ ∂ Inline graphicτ, which is ‖βSβ0S‖ = knτ, there is β* lying on the segment joining βS and β0S such that by the Taylor's expansion on L1 (βS):

Q1(βS)Q1(β0S)=(βSβ0S)TL1(β0S)+12(βSβ0S)T2L1(β)(βSβ0S)+j=1s[Pn(|βSj|)Pn(|β0S,j|)].

By Condition (i) ‖∇L1(β0S)‖ = Op(an), for any ε > 0, there exists C1 > 0, so that the event H1 satisfies P(H1) > 1 − ε/4 for all large n, where

H1={(βSβ0S)TL1(β0S)C1βSβ0San}. (B.3)

In addition, Condition (ii) yields that there exists Cε > 0 such that the following event H2 satisfies P(H2) ≥ 1 − ε/4 for all large n, where

H2={(βSβ0S)T2L1(β0S)(βSβ0S)>CεβSβ0S2}. (B.4)

Define another event H3 = {‖∇2L1(β0S) − ∇2L1(β*)‖F < Cε/4}. Since ‖βSβ0S‖ = knτ, by Condition (B.2) for any τ > 0, P(H3) > 1 — ε/4 for all large n. On the event H2H3, the following event H4 holds:

H4={(βSβ0S)T2L1(β)(βSβ0S)>3Cε4βSβ0S2}.

By Lemma B.1, j=1s[Pn(|βSj|)Pn(|β0S,j|)]sPn(dn)βSβ0S. Hence for any βS ∈ ∂ Inline graphicτ, on H1H4,

Q1(βS)Q1(β0S)knτ(3knτCε8C1ansPn(dn)).

For kn=an+sPn(dn), we have C1an+sPn(dn)(C1+1)kn. Therefore, we can choose τ > 8(C1 + 1)/(3Cε) so that Q1(βS)−Q1(β0S) ≥ 0 uniformly for β ∈ ∂ Inline graphicτ. Thus for all large n, when τ > 8(C1 + 1)/(3Cε),

P(Hn(τ))P(H1H4)1ε.

It remains to show that the local minimizer in Inline graphicτ (denoted by β̂S) is strict with a probability arbitrarily close to one. For each h ∈ ℝ/{0}, define

ψ(h)=lim supε0+supt1<t2(t1,t2)(|h|ε,|h|+ε)Pn(t2)Pn(t1)t2t1.

By the concavity of Pn(·), ψ(·) ≥ 0. We know that L1 is twice differentiable on ℝs. For βSInline graphicτ Let A(βS) = ∇2L1(βS) − diag{ψ(βS1), …, ψ(βSs)}. It suffices to show that A(β̂S) is positive definite with probability arbitrarily close to one. On the event H5 = {η(β̂S) ≤ supβB(β0S,cdn) η(β)} (where cdn is as defined in Assumption 4.1(iv)),

maxjsψ(β^S,j)η(β^S)supβB(β0S,cdn)η(β)}.

Also define events H6 = {‖∇2L1(βS) − ∇2L1(β0S)‖F < Cε/4} and H7 = {λmin(∇2L1(β0S)) > Cε. Then on H5H6H7, for any α ∈ ℝs satisfying ‖α‖ = 1, by Assumption 4.1(iv),

αTA(β^S)ααT2L1(β0S)α|αT(2L1(β^S)2L1(β0S))α|maxjsψ(β^S,j)3Cε/4supβB(β0S,dn)η(β)Cε/4

for all large n. This then implies λmin(A(β̂S)) ≥ Cε/4 for all large n.

We know that P(λmin[∇2L1(β0S)] > Cε) > 1 − ε. It remains to show that P(H5H6) > 1 — ε for arbitrarily small ε. Because kn = o(dn), for an arbitrarily small ε > 0, P(H5) > P(β̂SB(β0S, cdn)) ≥ 1 − ε/2 for all large n. Finally,

P(H6c)P(H6c,β^Sβ0Skn)+P(β^Sβ0S>kn)P(supβSβ0Skn2L1(βS)2L1(β0S)FCε/4)+ε/4=ε/2.

The previous theorem assumes that the true support S is known, which is not practical. We therefore need to derive the conditions under which S can be recovered from the data with probability approaching one. This can be done by demonstrating that the local minimizer of Qn restricted on is also a local minimizer on ℝp. The following theorem establishes the variable selection consistency of the estimator, defined as a local solution to a penalized regression problem on ℝp.

For any β ∈ ℝp, define the projection function

Tβ=(β1,β2,,βp)T,βj={βjifjS0,ifjS. (B.5)

Theorem B.2(Variable selection)

Suppose Ln : ℝp → ℝ satisfies the conditions in Theorem B.1, and Assumption 4.1 holds. Assume the following Condition A holds:

Condition A: With probability approaching one, for β̂S in Theorem B.1, there exists a neighborhood ⊂ ℝp of (β^ST,0)T, such that for all β=(βST,βNT)T but βN ≠ 0,

Ln(Tβ)Ln(β)<jSPn(|βj|). (B.6)

Then (i) with probability approaching one, β^=(β^ST,0)T is a local minimizer in ℝp of

Qn(β)=Ln(β)+i=1pPn(|βi|)

(ii) For an arbitrarily small ε > 0, the local minimizer β̂ is strict with probability at least 1 − ε, for all large n.

Proof. Let β^=(β^ST,0)T with β̂S being the local minimizer of Q1(βS) as in Theorem B.1. We now show: with probability approaching one, there is a random neighborhood of β̂, denoted by , so that ∀β = (βS, βN) ∈ with βN ≠ 0, we have Qn(β̂) < Qn(β). The last inequality is strict.

To show this, first note that we can take sufficiently small so that Q1(β̂) ≤ Q1(βS) because β̂S is a local minimizer of Q1(βS) from Theorem B.1. Recall the projection defined to be Tβ=(βST,0)T, and Qn( Inline graphicβ) = Q1(βS) by the definition of Q1. We have Qn(β̂) = Q1(β̂S) ≤ Q1(βS) = Qn( Inline graphicβ). Therefore, it suffices to show that with probability approaching one, there is a sufficiently small neighborhood of of β ^, so that for any β=(βST,βNT)T with βN ≠ 0, Qn( Inline graphicβ) < Qn(β).

In fact, this is implied by Condition (B.6):

Qn(Tβ)Qn(β)=Ln(Tβ)Ln(β)(j=1pPn(βj)j=1sPn(|(Tβ)j|))<0. (B.7)

The above inequality, together with the last statement of Theorem B.1 implies part (ii) of the theorem.

Appendix C: Proofs for Section 4

Throughout the proof, we write FiS = Fi(β0S), HiS = Hi(β0S) and ViS=(FiST,HiST)T.

Lemma C.1

  1. maxlp|1ni=1n(FijF¯j)2var(Fj)|=op(1).

  2. maxlp|1ni=1n(HijH¯j)2var(Hj)|=op(1).

  3. supβ∈ℝp λmax(J(β)) = Op(1), and λmin(J(β0)) is bounded away from zero with probability approaching one.

Proof. Parts (i)(ii) follow from an application of the standard large deviation theory by using Bernstein inequality and Bonferroni's method. Part (iii) follows from the assumption that var(Fj) and var(Hj) are bounded uniformly in jp.

C.1. Verifying conditions in Theorems B.1, B.2

C.1.1. Verifying conditions in Theorem B.1

For any β ∈ℝp, we can write Tβ=(βST,0)T. Define

LFGMM(βS)=[1ni=1ng(Yi,XiSTβS)ViS]TJ(β0)[1ni=1ng(Yi,XiSTβS)ViS].

Then FGMM (βS) = LFGMM(βS, 0).

Condition (i)

LFGMM(β0S)=2An(β0S)J(β0)[1ni=1ng(Yi,XiSTβ0S)ViS], where

An(βS)1ni=1nm(Yi,XiSTβS)XiSViST. (C.1)

By Assumption 4.5, ‖An(β0)‖ = Op(1). In addition, the elements in J(β0) are uniformly bounded in probability due to Lemma C.1. Hence LFGMM(β0S)Op(1)1ni=1ng(Yi,XiSTβ0S)ViS. Due to Eg(Y,XsTβ0S)VS=0, using the exponential-tail Bernstein inequality with Assumption 4.3 plus Bonferroni inequality, it can be shown that there is C > 0 such that for any t > 0,

P(maxlp|1ni=1ng(Yi,XiSTβ0S)Fli|>t)<P(maxlpP(|1ni=1ng(Yi,XiSTβ0S)Fli|>t)exp(logpCt2/n),

which implies maxlp|1ni=1ng(Yi,XiSTβ0S)Fli|=Op(logpn). Similarly, maxlp|1ni=1ng(Yi,XiSTβ0S)Hli|=Op(logpn). Hence LFGMM(β0S)=Op((slogp)/n).

Condition (ii)

Straightforward but tedious calculation yields

2LFGMM(β0S)=(β0S)+M(β0S),

where Σ(β0S) = 2An(β0S)J(β0)An(β0S)T, and M(β0S) = 2Z(β0S)B(β0S), with (suppose XiS = (Xil1, …, Xils)T)

Z(β0S)=1ni=1nqi(Yi,XiSβ0S)(Xil1XiS,,XilsXiS)ViST,B(β0S)=J(β0)1ni=1ng(Yi,XiSTβ0S)ViS.

It is not hard to obtain B(β0S)F=Op(slogp/n), and ‖Z(β0S)‖F = Op(s), and hence M(β0S)F=Op(sslogp/n)=op(1).

Moreover, there is a constant C > 0, P(minjSvar^(Xj)1>C)>1ε and P(minjpvar^(Xj2)1>C)>1ε for all large n and any ε > 0. This then implies P(λmin[J(β0)] > C) > 1 − ε. Recall Assumption 4.5 that λmin(EAn(β0S)EAn(β0S)T) > C2 for some C2 > 0. Define events

G1={λmin[J(β0)]>C},G2={M(β0S)<C2C/5}G3={An(β0S)An(β0S)TEAn(β0S)EAn(β0S)T)<C2/5}.

Then on the event i=13Gi,

λmin[2LFGMM(β0S)]2λmin(J(β0))λmin(An(β0S)An(β0S)T)M(β0S)F2C[λmin(EAn(β0S)EAn(β0S)T)C2/5]C2C/57CC2/5.

Note that P(i=13Gi)1i=13P(Gic)13ε. Hence Condition (B.1) is then satisfied.

Condition (iii)

It can be shown that for any nonnegative sequence αn = o(dn) where dn = minkS |β0k|/2, we have

P(supβSβ0SαnM(βS)M(β0S)Fδ)>1ε. (C.2)

holds for any ε and δ > 0. As for Σ(βS), note that for all βS such that ‖βSβ0S‖ < dn/2, we have βS,k ≠ 0 for all ks. Thus J(βS) = J(β0S). Then P(supβSβ0S‖<αnΣ(βS) − Σ(β0S)‖Fδ) > 1 − ε holds since P(supβSβ0S‖<αnAn(βS) − An(β0S)‖Fδ) > 1 − ε.

C.1.2. Verifying conditions in Theorem B.2

Proof. We verify Condition A of Theorem B.2, that is, with probability approaching one, there is a random neighborhood of β^=(β^ST,0)T, such that for any β=(βST,βNT)T with βN ≠ 0, condition (B.6) holds.

Let F( Inline graphicβ) = {Fl : lS, βl ≠ 0} and H( Inline graphicβ) = {Hl: lS, βl ≠ 0} for any fixed β=(βST,βNT)T. Define

Ξ(β)=[1ni=1ng(Yi,XiTβ)Fi(Tβ)]TJ1(Tβ)[1ni=1ng(Yi,XiTβ)Fi(Tβ)]+[1ni=1ng(Yi,XiTβ)Hi(Tβ)]TJ2(Tβ)[1ni=1ng(Yi,XiTβ)Hi(Tβ)]

where J1( Inline graphicβ) and J2( Inline graphicβ) are the upper-|S|0 and lower-|S|0 sub matrices of J( Inline graphicβ). Hence LFGMM( Inline graphic(β)) = Ξ( Inline graphicβ). Then LFGMM(β) − Ξ(β) equals

lS,βl0[wl1(1ni=1ng(yi,XiTβ)Fil)2+wl2(1ni=1ng(yi,XiTβ)Hil)2],

where wl1 = 1/var̂(Fl and wl2 = 1/var̂(Hl. So LFGMM(β) ≥ Ξ(β). This then implies LFGMM( Inline graphicβ) − LFGMM(β) ≤ Ξ( Inline graphicβ) − Ξ(β). By the mean value theorem, there exists λ ∈ (0,1), for h=(βST,λβNT)T,

Ξ(Tβ)Ξ(β)=lS,βl0βl[1ni=1nXilm(Yi,XiTh)Fi(Tβ)]TJ1(Tβ)[1ni=1ng(Yi,XiTh)Fi(Tβ)]+lS,βl0βl[1ni=1nXilm(Yi,XiTh)Hi(Tβ]TJ2(Tβ)[1ni=1ng(Yi,XiTh)Hi(Tβ)]lS,βl0βl(al(β)+bl(β)).

Let be a neighborhood of β^=(β^ST,0)T (to be determined later). We have shown that Ξ ( Inline graphicβ) − Ξ (β) = ΣlS,βl≠0 βl(al(β) + bl(β)), for any β,

al(β)=[1ni=1nXilm(Yi,XiTh)Fi(Tβ)]TJ1(Tβ)[1ni=1ng(Yi,XiTh)Fi(Tβ)],

and bl(β) is defined similarly based on H. Note that h lies in the segment joining β and Inline graphicβ, and is determined by β, hence should be understood as a function of β. By our assumption, there is a constant M, such that |m(t1, t2) | and |q(t1, t2)|, the first and second partial derivatives of g, and EXl2Fk2 are all bounded by M uniformly in t1, t2 and l, kp. Therefore the Cauchy-Schwarz and triangular inequalities imply

1ni=1nXilm(Yi,XiTh)Fi(Tβ)2M2maxlp|1ni=1nXilFiS2EXlFS2|+M2maxlSEXlFS2.

Hence there is a constant M1 such that if we define the event (again, keep in mind that h is determined by β)

Bn={supβ1ni=1nXilm(Yi,XiTh)Fi(Tβ)<sM1,supβJ1(Tβ)<M1},

then P(Bn) → 1. In addition with probability one,

1ni=1ng(Yi,XiTh)Fi(Tβ)supβ1ni=1ng(Yi,XiTβ)FiSsupβ1ni=1n[g(Yi,XiTβ)g(Yi,XiTβ^)]FiS+1ni=1ng(Yi,XiTβ^)FiSZ1+Z2,

where, β^=(β^ST,0)T. For some deterministic sequence rn (to be determined later), we can define the above to be dddd

={β:ββ^<rn/p}

then supβββ̂1 < rn. By the mean value theorem and Cauchy Schwarz inequality, there is β̃:

Z1=supβ1ni=1nm(Yi,XiTβ)FiSXiT(ββ^)ssupβ1ni=1nm(Yi,XiTβ)FiSXiTrnMsmaxkS,lp|1ni=1n(FikXil)2|1/2rn.

Hence there is a constant M2 such that P(Z1 < M2srn) → 1.

Let εi=g(Yi,XiTβ0). By the triangular inequality and mean value theorem, there are and h lying in the segment between β̂ and β0 such that

Z21ni=1nεiFiS+1ni=1nm(Yi,XiTh)FiSXiST(β^Sβ0S)smaxjp|1ni=1nεiFij|+1ni=1nm(Yi,XiTβ0)FiSXiST(β^Sβ0S)+1ni=1nq(Yi,XiTh)XiST(β0ShS)FiSXiST(β^Sβ0S)Op(slogp/n)+(op(1)+Em(Y,XTβ0)FSXST)β^Sβ0S(1ni=1nq(Yi,XiTh)XiS2)1/2(1ni=1nXiS2FiS2)1/2β^Sβ0S2,

where we used the assumption that Em(Y,XTβ0)XSFST=O(1). We showed that LFGMM(β0S)=Op((slogp)/n) in the proof of verifying conditions in Theorem B.1. Hence by Theorem B.1, β^Sβ0S=Op(slogp/n+sPn(dn)). Thus

Z2=Op(slogpn+sPn(dn)+s2slogsn+s2sPn(dn)2)Op(ξn).

By the assumption sξn=o(Pn(0+)), hence P(Z2<Pn(0+)/(8sM12))1, where M1 is defined in the event Bn. Consequently, if we define an event Dn={Z1<M2srn,Z2<Pn(0+)/(8sM12)}, then P(BnDn → 1, and on the event BnDn,

supβ|al(β)|M12s(M2srn+Pn(0+))/(8sM12)=M12M2srn+Pn(0+)/8.

We can rn<Pn(0+)/(8M12M2s), and thus supβ|al(β)|Pn(0+)/4.

On the other hand, Because ( Inline graphicβ)j = βj for either jS or βj = 0, there exists λ2 ∈ (0,1),

j=1p(Pn(|βj|)Pn(|(Tβ)j|)=jSPn(|βJ|)=lS,βl0|βl|Pn(λ2|βl|).

For all lS, |βl| ≤ ‖ββ01 < rn. Due to the non-increasingness of Pn(t), lSPn(|βl|)lS,βl0|βl|Pn(rn). We can make rn further smaller so that Pn(rn)Pn(0+)/2 which is satisfied for example, when rn < λn if SCAD(λn) is used as the penalty. Hence

lSβlal(β)lS|βl|Pn(0+)4lS|βl|Pn(rn)212lSPn(|βl|).

Using the same argument we can show lSβlbl(β)12lSPn(|βl|). Hence LFGMM( Inline graphicβ) − LFGMM(β) < ΣlS,βl≠0 βl (al(β)+bl(β)) ≤ ΣlS Pn(|βl|) for all β ∈ {β : ‖ββ̂1 < rn} under the event BnDn. Here rn is such that rn<Pn(0+)/(8M12M2s) and Pn(rn)Pn(0+)/2. This proves Condition A of Theorem B.2 due to P(BnDn) → 1.

C.2. Proof of Theorem 4.1: parts (ii) (iii)

We apply Theorem B.2 to infer that with probability approaching one, β^=(β^ST,0)Tis a local minimizer of QFGMM(β). Note that under the event that (β^ST,0)T is a local minimizer of QFGMM (β), we then infer that Qn (β) has a local minimizer (β^ST,β^NT)Tsuch that β̂N = 0. This reaches the conclusion of part (ii). This also implies P(ŜS) → 1.

By Theorem B.1, and LFGMM(β0S)=Op((slogp)/n) as proved in verifying conditions in Theorem B.1, we have ‖β0Sβ̂S‖ = op(dn). So

P(SS^)=P(jS,β^j=0)P(jS,|β0jβ^j||β0j|)P(maxjS|β0jβ^j|dn)P(β0Sβ^Sdn)=o(1)

This implies P(SŜ) → 1. Hence P(Ŝ = S) → 1.

C.3. Proof of Theorem 4.1: part (i)

Let Pn(|β^S|)=(Pn(|β^S1|),,Pn(|β^Ss|))T.

Lemma C.2

Under Assumption 4.1,

Pn(|β^S|)sgn(β^S)=Op(maxβSβ0Sdn/4η(β)slogp/n+sPn(dn)),

where ∘ denotes the element-wise product.

Proof. Write Pn(|β^S|)sgn(β^S)=(v1,,vs)T, where vi=Pn(|β^Si|)sgn(β^Si). By the triangular inequality and Taylor expansion,

|vi||Pn(|β^Si|)Pn(|β0S,i|)|+Pn(|β0S,i|)η(β)|β^Siβ0S,i|+Pn(dn)

where β* lies on the segment joining β̂S and β0S. For any ε > 0 and all large n,

P(η(β)>maxβSβ0Sdn/4η(β))P(β^Sβ0S>dn/4)<ε.

This implies η(β*) = Op(max‖βSβ0S‖<dn/4 η(β)). Therefore, Pn(|β^S|)sgn(β^S)2=i=1svj2 is upper-bounded by

2maxβSβ0Sdn/4η(β)2β^Sβ0S2+2sPn(dn)2,

which implies the result since β^Sβ0S=Op(slogp/n+sPn(dn)).

Lemma C.3

Let Ωn = √nΓ−1/2. Then for any unit vector α ∈ℝs,

αTΩnLFGMM(β0S)dN(0,1).

Proof. We have ∇FGMM(β0S) = 2An(β0S)J(β0)Bn, where Bn=1ni=1ng(Yi,XiSTβ0S)ViS. We write A=Em(Y,XsTβ0S)XSVST, ϒ=var(nBn)=var(g(Y,XSTβ0S)VS) and Γ = 4AJ(β0)ϒJ(β0)TAT.

By the weak law of large number and central limit theorem for iid data,

An(β0S)A=op(1),nαTϒ1/2BndN(0,1).

for any unit vector α̃∈ℝ2s. Hence by the Slutsky's theorem,

nαTΓ1/2LFGMM(β0S)dN(0,1).

Proof of Theorem 4.1: part (i)

Proof. The KKT condition of β̂S gives

Pn(|β^S|)sgn(β^S)=LFGMM(β^S), (C.3)

By the mean value theorem, there exists β* lying on the segment joining β0S and β̂S such that

LFGMM(β^S)=LFGMM(β0S)+2LFGMM(β)(β^Sβ0S).

Let D = (∇2FGMM(β*) − ∇2FGMM(β0S))(β̂Sβ0S). It then follows from (C.3) that for Ωn=nΓn1/2, and any unit vector α,

αTΩn2LFGMM(β0S)(β^Sβ0S)=αTΩn[Pn(|β^S|)sgn(β^S)+LFGMM(β0S)+D].

In the proof of Theorem 4.1, condition (ii), we showed that ∇2FGMM(β0S) = Σ + Op(1). Hence by Lemma C.3, it suffices to show αTΩn[Pn(|β^S|)sgn(β^S)+D]=op(1).

By Assumptions 4.5 and 4.6(i), λmin(Γn)−1/2 = Op(1).Thus ‖ αTΩn‖ = Op(√n). Lemma C.2 then implies λmax(Ωn)Pn(|β^S|)sgn(β^S) is bounded by Op(n)(maxβSβ0Sdn/4η(β)slogp/n+sPn(dn))=op(1).

It remains to prove ‖D‖ = op(n−1/2), and it suffices to show that

2LFGMM(β)2LFGMM(β0S)=op((slogp)1/2) (C.4)

due to β^Sβ0S=Op(slogp/n+sPn(dn)), and Assumption 4.6 that nsPn(dn)=o(1). Showing (C.4) is straightforward given the continuity of ∇2FGMM.

Appendix D: Proofs for Sections 5 and 6

The local minimizer in Theorem 4.1 is denoted by β^=(β^ST,β^NT)T and P(β̂N) → 1. Let β^G=(β^ST,0)T.

D.1. Proof of Theorem 5.1

Lemma D.1

LFGMM(β^G)=Op(slogp/n+sPn(dn)2)

Proof. We have, LFGMM(β^G)1ni=1ng(Yi,XiSTβ^S)ViS2Op(1). By Taylor expansion, with some β̂ in the segment joining β0S and β̂S,

1ni=1ng(Yi,XiSTβ^S)ViS1ni=1ng(Yi,XiSTβ0S)ViS+1ni=1nm(Yi,XiSTβS)XiSViSTβ^Sβ0SOp(slogp/n)+1ni=1nm(Yi,XiSTβ0S)XiSViSTβ^Sβ0S+1ni=1n|m(Yi,XiSTβS)m(Yi,XiSTβ0S)|XiSViSTβ^Sβ0S.

Note that Em(Y,XSTβ0S)XSVS is bounded due to Assumption 4.5. Apply Taylor expansion again, with some β̂*, the above term is bounded by

Op(slogp/n)+Op(1)β^Sβ0S+1ni=1n|q(Yi,XiSTβS)|XiSβSβ0SXiSViSTβ^Sβ0S.

Note that supt1,t2| q(t1, t2)| < ∞ by Assumption 4.4. The second term in the above is bounded by C1ni=1nXiSXiSViSTβ^Sβ0S2. Combining these terms, 1ni=1ng(Yi,XiSTβ^S)ViS is bounded by Op(slogp/n+sPn(dn))+Op(ss)β^Sβ0S2=Op(slogp/n+sPn(dn)).

Lemma D.2

Under the theorem's assumptions

QFGMM(β^G)=Op(slogpn+sPn(dn)2+smaxjSPn(|β0j|)+Pn(dn)slogsn).

Proof. By the foregoing lemma, we have

QFGMM(β^G)=Op(slogpn+sPn(dn)2)+j=1sPn(|βSj|).

Now, for some β̃Sj in the segment joining β̂Sj and β0j,

j=1sPn(|βSj|)j=1sPn(|β0S,j|)+j=1sPn(|βSj|)|βSjβ0S,j|smaxjSPn(|β0j|)+j=1sPn(dn)|βSjβ0S,j|smaxjsPn(|β0j|)+Pn(dn)β^Sβ0Ss.

The result then follows.

Note that ∀δ > 0,

infβΘδ{0}QFGMM(β)infβΘδ{0}LFGMM(β)infβΘδ{0}1ni=1ng(Yi,XiTβ)Vi(β)2minjp{var^(Xj),var^(Xj2)}.

Hence by Assumption 5.1, there exists γ > 0,

P(infβΘδ{0}QFGMM(β)>2γ)1.

On the other hand, by Lemma D.2, QFGMM(β̂G) = op(1). Therefore,

P(QFGMM(β^)+γ>infβΘδ{0}QFGMM(β))P(QFGMM(β^G)+γ>infβΘδ{0}QFGMM(β))+o(1)P(QFGMM(β^G)+γ>2γ)+P(infβΘδ{0}QFGMM(β)<2γ)+o(1)P(QFGMM(β^G)>γ)+o(1)=o(1).

Q.E.D.

D.2. Proof of Theorem 6.1

Lemma D.3

Define ρ(βS)=E[g(Y,XSTβS)σ(W)2D(W)]. Under the theorem assumptions, supβS ∈ Θρ(βS) − ρn(βS)‖ = op(1).

Proof. We first show three convergence results:

supβSΘ1ni=1ng(Yi,XiSTβS)(D(Wi)D^(Wi))σ^(Wi)2=op(1), (D.1)
supβSΘ1ni=1ng(Yi,XiSTβS)D(Wi)(σ^(Wi)2σ(Wi)2)=op(1), (D.2)
supβSΘ1ni=1ng(Yi,XiSTβS)D(Wi)σ(Wi)2Eg(Y,XSTβS)D(W)σ(W)2=op(1), (D.3)

Because both supw (w) − D(w)‖ and supw |σ̂(w)2σ(w)2| are op(1), proving (D.1) and (D.2) is straightforward. In addition, given the assumption that E(supβMg(Y,XSTβS)4)<, (D.3) follows from the uniform law of large number. Hence we have,

supβSΘ1ni=1ng(Yi,XiSTβS)D^(Wi)σ^(Wi)2Eg(Y,XSTβS)D(W)σ(W)2=op(1).

In addition, the event XS = S occurs with probability approaching one, given the selection consistency P(Ŝ = S) → 1 achieved in Theorem 4.1.

The result then follows because ρn(βS)=1ni=1ng(Yi,X^iSTβS)σ^(Wi)2D^(Wi).

Given Lemma D.3, Theorem 6.1 follows from a standard argument for the asymptotic normality of GMM estimators as in Hansen (1982) and Newey and McFadden (1994, Theorem 3.4). The asymptotic variance achieves the semi-parametric efficiency bound derived by Chamberlain (1987) and Severini and Tripathi (2001). Therefore, β̂* is semi-parametric efficient.

Appendix E: Proofs for Section 7

The proof of Theorem 7.1 is very similar to that of Theorem 4.1, which we leave to the online supplementary material, downloadable from http://terpconnect.umd.edu/∼yuanliao/high/supp.pdf

Proof of Theorem 7.2

Proof. Define Ql,k=LK(β(k)(l),βk(l))+jkPn(|βj(l)|)+j>kPn(|βj(l1)|). We first show Ql,kQl,k−1 for 1 < kp and Ql+1,1Ql,p. For 1 < kp, Ql,kQl,k−1 equals

LK(β(k)(l),βk(l))+Pn(|βk(l)|)[LK(β((k1))(l),βk1(l))+Pn(|βk(l1)|)].

Note that the difference between (β(k)(l),βk(l)) and (β((k1))(l),βk1(l)) only lies on the kth position. The kth position of (β(k)(l),βk(l)) is βk(l) while that of (β((k1))(l),βk1(l)) is βk(l1). Hence by the updating criterion, Ql,kQl,k−1 for kp.

Because (β(1)(l+1),β1(l+1)) is the first update in the l + 1th iteration, (β(1)(l+1),β1(l+1))=(β(1)(l),β1(l+1)). Hence

Ql+1,1=LK(β(1)(l),β1(l+1))+Pn(|β1(l+1)|)+j>1Pn(|βj(l)|).

On the other hand, for β(l)=(β(p)(l),βp(l)),

Ql,p=LK(β(l))+j>1Pn(|βj(l)|)+Pn(|β1(l)|).

Hence Ql+1,1Ql,p=LK(β(1)(l),β1(l+1))+Pn(|β1(l+1)|)[LK(β(l))+Pn(|β1(l)|)]. Note that (β(1)(l),β1(l+1)) differs β(l) only on the first position. By the updating criterion, Ql+1,1Ql,p ≤ 0.

Therefore, if we define {Lm}m≥1 = {Q1,1, …,Q1,p, Q2,1, …,Q2,p, …}, then we have shown that {Lm}m≥1 is a non-increasing sequence. In addition, Lm ≥ 0 for all m ≥ 1. Hence Lm is a bounded convergent sequence, which also implies that it is Cauchy. By the definition of QK(β(l)), we have QK(β(l)) = Ql,p, and thus {QK(β(l))}l≥1 is a sub-sequence of {Lm}. Hence it is also bounded Cauchy. Therefore, for any ε > 0, there is N > 0, when l1, l2 > N, | QK(β(l1)) − QK(β(l2))| < ε, which implies that the iterations will stop after finite steps.

The rest of the proof is similar to that of the Lyapunov's theorem of Lange (1995, Prop. 4). Consider a limit point β* of {β(l)} l≥1 such that there is a subsequence limk→∞ β(lk) = β*. Because both QK(·) and M(·) are continuous, and QK(β(l)) is a Cauchy sequence, taking limits yields

QK(M(β))=limkQK(M(β(lk)))=limkQK(β(lk))=QK(β).

Hence β* is a stationary point of QK(β).

Footnotes

1

In fixed designs, e.g., Zhao and Yu (2006), it has been implicitly assumed that n1i=1nεiXij=op(1) for all j < p.

2

We thank the AE and referees for suggesting the use of a general vector of instrument W, which extends to the more general endogeneity problem, allowing the presence of endogenous important regressors. In particular, W is allowed to be XS, which amounts to assume that E(ε|XS) = 0 by (1.4), but allow E(ε|X) ≠ 0. In this case, we can allow the instruments W = XS to be unknown, and F and H to be defined below can be transformations of X. This is the setup of an earlier version of this paper, which is much weaker than (1.2) and allows some of XN to be endogenous.

3

The compatibility of (1.6) requires very stringent conditions. If EFSXST are invertible, then a necessary condition for (1.6) to have a common solution is that (EFSXST)1E(YFS)=(EHSXST)1E(YHS), which does not hold in general when FH.

4

For technical reasons we use a diagonal weight matrix and it is likely non-optimal. However, it does not affect the variable selection consistency in this step.

5

We thank a referee for reminding us this important research direction.

Contributor Information

Jianqing Fan, Email: jqfan@princeton.edu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yuan Liao, Email: yuanliao@umd.edu, Department of Mathematics, University of Maryland, College Park, MD 20742.

References

  1. Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
  2. Andrews D. Consistent moment selection procedures for generalized method of moments estimation. Econometrica. 1999;67:543–564. [Google Scholar]
  3. Andrews D, Lu B. Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models. J Econometrics. 2001;101:123–164. [Google Scholar]
  4. Antoniadis A. Smoothing noisy data with tapered coiflets series. Scand J Stat. 1996;23:313–330. [Google Scholar]
  5. Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]
  6. Belloni A, Chernozhukov V. Least squares after model selection in high-dimensional sparse models. Bernoulli. 2013;19:521–547. [Google Scholar]
  7. Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection amongst high-dimensional controls. Review of Economic Studies 2013 Forthcoming. [Google Scholar]
  8. Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Springer; New York: 1998. [Google Scholar]
  9. Bickel P, Ritov Y, Tsybakov R. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
  10. Bondell H, Reich B. Consistent high-dimensional Bayesian variable selection via penalized credible regions. J Amer Statist Assoc. 2012;107:1610–1624. doi: 10.1080/01621459.2012.716344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. J R Stat Soc Ser B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Breheny P, Huang J. Coordinate descent algorithms for non convex penalized regression, with applications to biological feature selection. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional models: partially faithful distributions and the PC-simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]
  14. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; New York: 2011. [Google Scholar]
  15. Canay I, Santos A, Shaikh A. On the testability of identification in some nonparametric odes with endogeneity. Econometrica. 2013;81:2535–2559. [Google Scholar]
  16. Caner M. Lasso-type GMM estimator. Econometric Theory. 2009;25:270–290. [Google Scholar]
  17. Caner M, Fan Q. Hybrid generalized empirical likelihood estimators: instrument selection with adaptive lasso. Manuscript 2012 [Google Scholar]
  18. Caner M, Zhang H. Adaptive elastic net GMM with diverging number of moments. Journal of Business and Economic Statistics. 2013 doi: 10.1080/07350015.2013.836104. forthcoming. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Candès E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist. 2007;35:2313–2404. [Google Scholar]
  20. Chamberlain G. Asymptotic efficiency in estimation with conditional moment restrictions. J Econometrics. 1987;34:305–334. [Google Scholar]
  21. Chen X. Large sample sieve estimation of semi-nonparametric models. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics. VI ch 76 2007. [Google Scholar]
  22. Chen X, Pouzo D. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica. 2012;80:277–321. [Google Scholar]
  23. Chernozhukov V, Hong H. An MCMC approach to classical estimation. J Econometrics. 2003;115:293–346. [Google Scholar]
  24. Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math. 2004;57:1413–1457. [Google Scholar]
  25. Dominguez M, Lobato I. Consistent estimation of models defined by conditional moment restrictions. Econometrica. 2004;72:1601–1615. [Google Scholar]
  26. Donald S, Imbens G, Newey W. Choosing instrumental variables in conditional moment restriction models. J Econometrics. 2009;153:28–36. [Google Scholar]
  27. Engle R, Hendry D, Richard J. Exogeneity. Econometrica. 1983;51:277–304. [Google Scholar]
  28. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
  29. Fan J, Liao Y. Endogeity in ultra high dimensions. Manuscript 2012 [Google Scholar]
  30. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Fan J, Yao Q. Efficient estimation of conditional variance functions in stochastic regression. Biometrika. 1998;85:645–660. [Google Scholar]
  33. Fu W. Penalized regression: The bridge versus the LASSO. J Comput Graph Statist. 1998;7:397–416. [Google Scholar]
  34. García E. Linear regression with a large number of weak instruments using a post-l1-penalized estimator. Manuscript 2011 [Google Scholar]
  35. Gautier E, Tsybakov A. High dimensional instrumental variables regression and confidence sets. Manuscript 2011 [Google Scholar]
  36. van de Geer S. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. doi: 10.1214/18-AOS1761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Hall P, Horowitz J. Nonparametric methods for inference in the presence of instrumental variables. Ann Statist. 2005;33:2904–2929. [Google Scholar]
  38. Hansen L. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]
  39. Horowitz J. A smoothed maximum score estimator for the binary response model. Econometrica. 1992;60:505–531. [Google Scholar]
  40. Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
  41. Huang J, Ma S, Zhang C. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
  42. Hunter D, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Kim Y, Choi H, Oh H. Smoothly Clipped Absolute Deviation on High Dimensions. J Amer Statist Assoc. 2008;103:1665–1673. [Google Scholar]
  44. Lange K. A gradient algorithm locally equivalent to the EM algorithm. J Roy Statist Soc Ser B. 1995;57:425–437. [Google Scholar]
  45. Kitamura Y, Tripathi G, Ahn H. Empirical likelihood-based inference in conditional moment restriction models. Econometrica. 2004;72:1667–1714. [Google Scholar]
  46. Leeb H, Pötscher B. Sparse estimators and the oracle property, or the return of Hodges' estimator. J Econometrics. 2008;142:201–211. [Google Scholar]
  47. Liao Z. Adaptive GMM shrinkage estimation with consistent moment selection. Econometric Theory. 2013;29:857–904. [Google Scholar]
  48. Loh P, Wainwright M. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Manuscript 2013 [Google Scholar]
  49. Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
  50. Newey W. Semiparametric efficiency bound. J Appl Econometrics. 1990;5:99–125. [Google Scholar]
  51. Newey W. Efficient estimation of models with conditional moment restrictions. In: Maddala GS, Rao CR, Vinod HD, editors. Handbook of Statistics, Volume 11: Econometrics. Amsterdam; North-Holland: 1993. [Google Scholar]
  52. Newey W, McFadden D. Large sample estimation and hypothesis testing. In: Engle R, McFadden D, editors. Handbook of Econometrics. Chapter 36. 1994. [Google Scholar]
  53. Newey W, Powell J. Instrumental variable estimation of nonpara-metric models. Econometrica. 2003;71:1565–1578. [Google Scholar]
  54. Städler N, Bühlmann P, van de Geer S. l1-penalization for mixture regression models with discussion. Test. 2010;19:209–256. [Google Scholar]
  55. Severini T, Tripathi G. A simplified approach to computing efficiency bounds in semiparametric models. J Econometrics. 2001;102:23–66. [Google Scholar]
  56. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
  57. Wasserman L, Roeder K. High-dimensional variable selection. Ann Statist. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
  59. Zhang C, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear models. Ann Statist. 2008;36:1567–1594. [Google Scholar]
  60. Zhang C, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
  61. Zhao P, Yu B. On model selection consistency of Lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
  62. Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
  63. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl

RESOURCES