Endogeneity in High Dimensions

Jianqing Fan; Yuan Liao

doi:10.1214/13-AOS1202

. Author manuscript; available in PMC: 2015 Jan 8.

Published in final edited form as: Ann Stat. 2014 May 20;42(3):872–917. doi: 10.1214/13-AOS1202

Endogeneity in High Dimensions

Jianqing Fan ¹, Yuan Liao ²

PMCID: PMC4286899 NIHMSID: NIHMS649193 PMID: 25580040

Abstract

Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogenous. Yet, endogeneity can arise incidentally from a large pool of regressors in a high-dimensional regression. This causes the inconsistency of the penalized least-squares method and possible false scientific discoveries. A necessary condition for model selection consistency of a general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the incidental endogeneity, we construct a novel penalized focused generalized method of moments (FGMM) criterion function. The FGMM effectively achieves the dimension reduction and applies the instrumental variable methods. We show that it possesses the oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.

Keywords: Focused GMM, Sparsity recovery, Endogenous variables, Oracle property, Conditional moment restriction, Estimating equation, Over identification, Global minimization, Semi-parametric efficiency

1. Introduction

In high-dimensional models, the overall number of regressors p grows extremely fast with the sample size n. It can be of order exp(n^α), for some α ∈ (0, 1). What makes statistical inference possible is the sparsity and exogeneity assumptions. For example, in the linear model

Y = X^{T} β_{0} + ε,

(1.1)

it is assumed that the number of elements in S = {j : β₀_j ≠ 0} is small and EεX = 0, or more stringently

E (ε | X) = E (Y - X^{T} β_{0} | X) = 0 .

(1.2)

The latter is called “exogeneity”. One of the important objectives of high-dimensional modeling is to achieve the variable selection consistency and make inference on the coefficients of important regressors. See, for example, Fan and Li (2001), Hunter and Li (2005), Zou (2006), Zhao and Yu (2006), Huang, Horowitz and Ma (2008), Zhang and Huang (2008), Wasserman and Roeder (2009), Lv and Fan (2009), Zou and Zhang (2009), Städler, Bühlmann and van de Geer (2010), and Bühlmann, Kalisch and Maathuis (2010). In these papers, (1.2) (or EεX = 0) has been assumed either explicitly or implicitly¹. Condition of this kind is also required by the Dantzig selector of Candès and Tao (2007), which solves an optimization problem with constraint ${max}_{j \leq p} | \frac{1}{n} \sum_{i = 1}^{n} X_{i j} (Y_{i} - X_{i}^{T} β) | < C \sqrt{\frac{log p}{n}}$ for some C > 0.

In high-dimensional models, requesting that ε and all the components of X be uncorrelated as (1.2), or even more specifically

E (Y - X^{T} β_{0}) X_{j} = 0, \forall j = 1, \dots, p,

(1.3)

can be restrictive particularly when p is large. Yet, (1.3) is a necessary condition for popular model selection techniques to be consistent. However, violations to assumption either (1.2) or (1.3) can arise as a result of selection biases, measurement errors, autoregression with autocorrelated errors, omitted variables, and from many other sources (Engle, Hendry and Richard 1983). They also arise from unknown causes due to a large pool of regressors, some of which are incidentally correlated with the random noise Y−X^Tβ₀. For example, in genomics studies, clinical or biological outcomes along with expressions of tens of thousands of genes are frequently collected. After applying variable selection techniques, scientists obtain a set of genes Ŝ that are responsible for the outcome. Whether (1.3) holds, however, is rarely validated. Because there are tens of thousands of restrictions in (1.3) to validate, it is likely that some of them are violated. Indeed, unlike low-dimensional least-squares, the sample correlations between residuals ∊̂, based on the selected variables X_Ŝ, and predictors X, are unlikely to be small, because all variables in the large set Ŝ^c are not even used in computing the residuals. When some of those are unusually large, endogeneity arises incidentally. In such cases, we will show that Ŝ can be inconsistent. In other words, violation of assumption (1.3) can lead to false scientific claims.

We aim to consistently estimate β₀ and recover its sparsity under weaker conditions than (1.2) or (1.3) that are easier to validate. Let us assume that $β_{0} = {(β_{0 S}^{T}, 0)}^{T}$ and X can be partitioned as $X = {(X_{S}^{T}, X_{N}^{T})}^{T}$ . Here X_S corresponds to the nonzero coefficients β₀_S, which we call important regressors, and X_N represents the unimportant regressors throughout the paper, whose coefficients are zero. We borrow the terminology of endogeneity from the econometric literature. A regressor is said to be endogenous when it is correlated with the error term, and exogenous otherwise. Motivated by the aforementioned issue, this paper aims to select X_S with probability approaching one and making inference about β₀_S, allowing components of X to be endogenous. We propose a unified procedure that can address the problem of endogeneity to be present in either important or unimportant regressors, or both, and we do not require the knowledge of which case of endogeneity is present in the true model. The identities of X_S are unknown before the selection.

The main assumption we make is that, there is a vector of observable instrumental variables W such that

E [ε | W] = 0 .

(1.4)

Briefly speaking, W is called an “instrumental variable” when it satisfies (1.4) and is correlated with the explanatory variable X. In particular, as noted in the footnote, W = X_S is allowed so that the instruments are unknown but no additional data are needed. Instrumental variables (IV) have been commonly used in the literature of both econometrics and statistics in the presence of endogenous regressors, to achieve identification and consistent estimations (e.g., Hall and Horowitz 2005). An advantage of such an assumption is that it can be validated more easily. For example, when W = X_S, one needs only to check whether the correlations between ∊̂ and X_Ŝ are small or not, with X_Ŝ being a relatively low-dimensional vector, or more generally, the moments that are actually used in the model fitting such as (1.5) below hold approximately In short, our setup weakens the assumption (1.2) to some verifiable moment conditions.

What makes the variable selection consistency (with endogeneity) possible is the idea of over identification. Briefly speaking, a parameter is called “over-identified” if there are more restrictions than those are needed to grant its identifiability (for linear models, for instance, when the parameter satisfies more equations than its dimension). Let (f₁,…, f_p) and (h₁,…, h_p) be two different sets of transformations, which can be taken as a large number of series terms, e.g., B-splines and polynomials. Here each f_j and h_j are scalar functions. Then (1.4) implies

E (ε f_{j} (W)) = E (ε h_{j} (W)) = 0, j = 1, \dots, p .

Write F = (f₁(W),…, f_p(W))^T, and H = (h₁(W),…, h_p(W))^T. We then have EεF = EεH = 0. Let S be the set of indices of important variables, and let F_S and H_S be the subvectors of F and H corresponding to the indices in S. Implied by EεF = EεH = 0, and $ε = Y - X_{S}^{T} β_{0 S}$ , there exists a solution β_S = β₀_S to the over-identified equations (with respect to β_S) such as

E (Y - X_{S}^{T} β_{S}) F_{S} = 0 and E (Y - X_{S}^{T} β_{S}) H_{S} = 0 .

(1.5)

In (1.5), we have twice as many linear equations as the number of unknowns, yet the solution exists and is given by β_S = β₀_S. Because β₀_S satisfies more equations than its dimension, we call β₀_S to be over-identified. On the other hand, for any other set S̃ of variables, if S ⊄ S̃, then the following 2|S̃| equations (with |S̃| = dim(β_S̃) unknowns)

E (Y - X_{\tilde{S}}^{T} β_{\tilde{S}}) F_{\tilde{S}} = 0 and E (Y - X_{\tilde{S}}^{T} β_{\tilde{S}}) H_{\tilde{S}} = 0

(1.6)

have no solution as long as the basis functions are chosen such that F_S̃ ≠ H _S̃.³ The above setup includes W = X_S with F = X and H = X² as a specific example (or H = cos(X) + 1 if X contain many binary variables).

We show that in the presence of endogenous regressors, the classical penalized least squares method is no longer consistent. Under model

Y = X_{S}^{T} β_{0 S} + ε, E (ε | W) = 0,

we introduce a novel penalized method, called focused generalized method of moments (FGMM), which differs from the classical GMM (Hansen 1982) in that the working instrument V(β) in the moment functions $n^{- 1} \sum_{i = 1}^{n} (Y_{i} - X_{i}^{T} β) V (β)$ for FGMM also depends irregularly on the unknown parameter β (which also depends on (F, H), see Section 3 for details). With the help of over identification, the FGMM successfully eliminates those subset S̃ such that S ⊄ S̃. As we will see in Section 3, a penalization is still needed to avoid over-fitting. This results in a novel penalized FGMM.

We would like to comment that FGMM differs from the low-dimensional techniques of either moment selection (Andrews 1999, Andrews and Lu 2001) or shrinkage GMM (Liao 2013) in dealing with mis-specifications of moment conditions and dimension reductions. The existing methods in the literature on GMM moment selections cannot handle high-dimensional models. Recent literature on the instrumental variable method for high-dimensional models can be found in, e.g., Belloni et al. (2012), Caner and Fan (2012), García (2011). In these papers, the endogenous variables are in low dimensions. More closely related work is by Gautier and Tsybakov (2011), who solved a constrained minimization as an extension of Dantzig selector. Our paper, in contrast, achieves the oracle property via a penalized GMM. Also, we study a more general conditional moment restricted model that allows nonlinear models.

The remainder of this paper is as follows: Section 2 gives a necessary condition for a general penalized regression to achieve the oracle property. We also show that in the presence of endogenous regressors, the penalized least squares method is inconsistent. Sections 3 constructs a penalized FGMM, and discusses the rationale of our construction. Section 4 shows the oracle property of FGMM. Section 5 discusses the global optimization. Section 6 focuses on the semi-parametric efficient estimation after variable selection. Section 7 discusses numerical implementations. We present simulation results in Section 8. Finally, Section 9 concludes. Proofs are given in the appendix.

Notation

Throughout the paper, let λ_min(A) and λ_max(A) be the smallest and largest eigenvalues of a square matrix A. We denote by ‖A‖_F, ‖A‖ and ‖A‖_∞ as the Frobenius, operator and element-wise norms of a matrix A respectively, defined respectively as ‖A‖_F = tr^1/2(A^T A), $‖ A ‖ = λ_{max}^{1 / 2} (A^{T} A)$ , and ‖A‖_∞ = max_i,j ‖A_ij‖. For two sequences a_n and b_n, write a_n ≪ b_n (equivalently, b_n ≫ a_n) if a_n = o(b_n). Moreover, |β|₀ denotes the number of nonzero components of a vector β. Finally, $P_{n}^{'} (t)$ and $P_{n}^{″} (t)$ denote the first and second derivatives of a penalty function P_n(t), if exist.

2. Necessary Condition for Variable Selection Consistency

2.1. Penalized regression and necessary condition

Let s denote the dimension of the true vector of nonzero coefficients β₀_S. The sparse structure assumes that s is small compared to the sample size. A penalized regression problem, in general, takes a form of:

min_{β \in ℝ^{p}} L_{n} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |),

where P_n(·) denotes a penalty function. There are relatively less attentions to the necessary conditions for the penalized estimator to achieve the oracle property. Zhao and Yu (2006) derived an almost necessary condition for the sign consistency, which is similar to that of Zou (2006) for the least squares loss with Lasso penalty. To the authors' best knowledge, so far there has been no necessary condition on the loss function for the selection consistency in the high-dimensional framework. Such a necessary condition is important, because it provides us a way to justify whether a specific loss function can result in a consistent variable selection.

Theorem 2.1 (Necessary Condition)

Suppose:

L_n(β) is twice differentiable, and

$max_{1 \leq l, j \leq p} | \frac{\partial^{2} L_{n} (β_{0})}{\partial β_{l} \partial β_{j}} | = O_{p} (1) .$
There is a local minimizer β̂ = (β̂_S, β̂_N)^T of

$L_{n} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |)$

such that P(β̂_N = 0) → 1, and √s ‖β̂ − β₀‖ = o_p(1).
The penalty satisfies: P_n(·) ≥ 0, P_n(0) = 0, $P_{n}^{'} (t)$ is non-increasing when t ∈ (0, u) for some u > 0, and ${lim}_{n \to \infty} {lim}_{t \to 0^{+}} P_{n}^{'} (t) = 0$ . Then for any l ≤ p,

| \frac{\partial L_{n} (β_{0})}{\partial β_{l}} | \to^{p} 0 .

(2.1)

The implication (2.1) is fundamentally different from the “irrepresentable condition” in Zhao and Yu (2006) and that of Zou (2006). It imposes a restriction on the loss function L_n(·), whereas the “irrepresentable condition” is derived under the least squares loss and E(εX) = 0. For the least squares, (2.1) reduces to either $n^{- 1} \sum_{i = 1}^{n} ε_{i} X_{i l} = o_{p} (1)$ or EεX_l = 0, which requires a exogenous relationship between ε and X. In contrast, the irrepresentable condition requires a type of relationship between important and unimportant regressors and is specific to Lasso. It also differs from the Karush-Kuhn-Tucker (KKT) condition (e.g., Fan and Lv 2011) in that it is about the gradient vector evaluated at the true parameters rather than at the local minimizer.

The conditions on the penalty function in condition (iii) are very general, and are satisfied by a large class of popular penalties, such as Lasso (Tibshirani 1996), SCAD (Fan and Li 2001) and MCP (Zhang 2010), as long as their tuning parameter λ_n → 0. Hence this theorem should be understood as a necessary condition imposed on the loss function instead of the penalty.

2.2. Inconsistency of least squares with endogeneity

As an application of Theorem 2.1, consider a linear model:

Y = X^{T} β_{0} + ε = X_{S}^{T} β_{0 S} + ε,

(2.2)

where we may not have E(εX) = 0.

The conventional penalized least squares (PLS) problem is defined as:

min_{β} \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2} + \sum_{j = 1}^{p} P_{n} (| β_{j} |) .

In the simpler case when s, the number of nonzero components of β₀, is bounded, it can be shown that if there exist some regressors correlated with the regression error ε, the PLS does not achieve the variable selection consistency. This is because (2.1) does not hold for the least squares loss function. Hence without the possibly ad-hoc exogeneity assumption, PLS would not work any more, as more formally stated below.

Theorem 2.2 (Inconsistency of PLS)

Suppose the data are i.i.d., s = O(1), and X has at least one endogenous component, that is, there is l such that |E(X_lε)| > c for some c > 0. Assume that $E X_{l}^{4} < \infty$ , Eε⁴ < ∞, and P_n (t) satisfies the conditions in Theorem 2.1. If $\tilde{β} = {({\tilde{β}}_{S}^{T}, {\tilde{β}}_{N}^{T})}^{T}$ , corresponding to the coefficients of (X_S, X_N), is a local minimizer of

\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2} + \sum_{j = 1}^{p} P_{n} (| β_{j} |),

then either ‖β̃_S – β₀_S ‖ = o_p(1), or lim sup_n→_∞ P(β̃_N = 0) < 1.

The index l in the condition of the above theorem does not have to be an index of an important regressor. Hence the consistency for penalized least squares will fail even if the endogeneity is only present on the unimportant regressors.

We conduct a simple simulated experiment to illustrate the impact of endogeneity on the variable selection. Consider

\begin{matrix} Y = X^{T} β_{0} + ε, ε ~ N (0, 1), \\ β_{0 S} = (5, - 4, 7, - 2, 1.5); β_{0 j} = 0, for 6 \leq j \leq p . \\ X_{j} = Z_{j} for j \leq 5, X_{j} = (Z_{j} + 5) (1 + ∊), for 6 \leq j \leq p . \\ Z ~ N_{p} (0, \sum), independent of ε, with {(\sum)}_{i j} = {0.5}^{| i - j |} . \end{matrix}

In the design, the unimportant regressors are endogenous. The penalized least squares (PLS) with SCAD-penalty was used for variable selection. The λ's in the table represent the tuning parameter used in the SCAD-penalty. The results are based on the estimated ${({\hat{β}}_{S}^{T}, {\hat{β}}_{N}^{T})}^{T}$ , obtained from minimizing PLS and FGMM loss functions respectively (we shall discuss the construction of FGMM loss function and its numerical minimization in detail subsequently). Here β̃_S and β̂_N represent the estimators for coefficients of important and unimportant regressors respectively.

From Table 1, PLS selects many unimportant regressors (FP). In contrast, the penalized FGMM performs well in both selecting the important regressors and eliminating the unimportant ones. Yet, the larger MSE_S of β̂_S by FGMM is due to the moment conditions used in the estimate. This can be improved further in Section 6. Also, when endogeneity is present on the important regressors, PLS estimator will have larger bias (see additional simulation results in Section 8.)

Table 1.

Performance of PLS and FGMM over 100 replications. p = 50, n = 200

	PLS				FGMM
	λ = 0.05	λ = 0.1	λ = 0.5	λ = 1	λ = 0.05	λ = 0.1	λ = 0.2
MSE_S	0.145 (0.053)	0.133 (0.043)	0.629 (0.301)	1.417 (0.329)	0.261 (0.094)	0.184 (0.069)	0.194 (0.076)
MSE_N	0.126 (0.035)	0.068 (0.016)	0.072 (0.016)	0.095 (0.019)	0.001 (0.010)	0 (0)	0.001 (0.009)
TP	5 (0)	5 (0)	4.82 (0.385)	3.63 (0.504)	5 (0)	5 (0)	5 (0)
FP	37.68 (2.902)	35.36 (3.045)	8.84 (3.334)	2.58 (1.557)	0.08 (0.337)	0 (0)	0.02 (0.141)

Open in a new tab

MSE_S is the average of ‖β̂_S − β_0S‖ for nonzero coefficients. MSE_N is the average of ‖β̂_N − β_0N‖ for zero coefficients. TP is the number of correctly selected variables, and FP is the number of incorrectly selected variables. The standard error of each measure is also reported.

3. Focused GMM

3.1. Definition

Because of the presence of endogenous regressors, we introduce an instrumental variable (IV) regression model. Consider a more general nonlinear model:

E [g (Y, X_{S}^{T} β_{0 S}) | W] = 0,

(3.1)

where Y stands for the dependent variable; g :ℝ × ℝ → ℝ is a known function. For simplicity, we require g be one-dimensional, and should be thought of as a possibly nonlinear residual function. Our result can be naturally extended to a multi-dimensional g function. Here W is a vector of observed random variables, known as instrumental variables.

Model (3.1) is called a conditional moment restricted model, which has been extensively studied in the literature, e.g., Newey (1993), Donald et al. (2009), Kitamura et al (2004). The high-dimensional model is also closely related to the semi/nonparametric model estimated by sieves with a growing sieve dimension, e.g., Ai and Chen (2003). Recently van de Geer (2008) and Fan and Lv (2011) considered generalized linear models without endogeneity. Some interesting examples of the generalized linear model that fit into (3.1) are:

linear regression, g(t₁, t₂) = t₁ − t₂;
logit model, g(t₁, t₂) = t₁ − exp(t₂)/(1 + exp(t₂));
probit model, g(t₁, t₂) = t₁ − Φ(t₂) where Φ(·) denotes the standard normal cumulative distribution function.

Let (f₁, …, f_p) and (h₁,…, h_p) be two different sets of transformations of W, which can be taken as a large number of series basis, e.g., B-splines, Fourier series, polynomials (see Chen 2007 for discussions of the choice of sieve functions). Here each f_j and h_j are scalar functions. Write F = (f₁(W),…, f_p(W))^T, and H = (h₁(W),…, h_p(W))^T. The conditional moment restriction (3.1) then implies that

E [g (Y, X_{S}^{T} β_{0 S}) F_{S}] = 0, and E [g (Y, X_{S}^{T} β_{0 S}) H_{S}] = 0,

(3.2)

where F_S and H_S are the subvectors of F and H whose supports are on the oracle set S = {j ≤ p : β₀_j ≠ 0}. In particular, when all the components of X_S are known to be exogenous, we can take F = X and H = X² (the vector of squares of X taken coordinately), or H = cos(X) + 1 if X is a binary variable. A typical estimator based on moment conditions like (3.2) can be obtained via the generalized method of moments (GMM, Hansen 1982). However, in the problem considered here, (3.2) cannot be used directly to construct the GMM criterion function, because the identities of X_S are unknown.

Remark 3.1

One seemingly working solution is to define V as a vector of transformations of W, for instance V = F, and employ GMM to the moment condition E[g(Y, X^T β₀)V] = 0. However, one has to take dim(V) ≥ dim(β) = p to guarantee that the GMM criterion function has a unique minimizer (in the linear model for instance). Due to p ≫ n, the dimension of V is too large, and the sample analogue of the GMM criterion function may not converge to its population version due to the accumulation of high-dimensional estimation errors.

Let us introduce some additional notation. For any β ∈ ℝ^p/{0}, and i = 1, …, n, define r = |β|₀-dimensional vectors

F_i(β) = (f_l₁(W_i),…, f_{l_r}(W_i))^T and H_i(β) = (h_l₁(W_i),…, h_{l_r}(W_i))^T, where (l₁, …, l_r) are the indices of nonzero components of β. For example, if p = 3 and β = (−1, 0, 2)^T, then F_i(β) = (f₁(W_i), f₃(W_i))^T, and H_i(β) = (h₁(W_i), h₃(W_i))^T, i ≤ n.

Our focused GMM (FGMM) loss function is defined as

L_{FGMM} (β) = \sum_{j = 1}^{p} I_{(β_{j} \neq 0)} {w_{j 1} {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) f_{j} (W_{i})]}^{2} + w_{j 2} {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) h_{j} (W_{i})]}^{2}},

(3.3)

where w_j₁ and w_j₂ are given weights. For example, we will take $w_{j 1} = 1 / \hat{var} (f_{j} (W))$ and $w_{j 2} = 1 / \hat{var} (h_{j} (W))$ to standardize the scale (here $\hat{var}$ represents the sample variance). Writing in the matrix form, for V_i(β) = (F_i(β^T,H _i(β)^T)^T,

L_{FGMM} (β) = {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i} (β)]}^{T} J (β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i} (β)],

where J(β) = diag{w_l₁1, …, w_{l_r}₁, w_l₁2, …, w_{l_r}₂}.⁴

Unlike the traditional GMM, the “working instrumental variables” V(β) depend irregularly on the unknown β. As to be further explained, this ensures the dimension reduction, and allows to focus only on the equations with the IV whose support is on the oracle space, and is therefore called the focused GMM or FGMM for short.

We then define the FGMM estimator by minimizing the following criterion function:

Q_{FGMM} (β) = L_{FGMM} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |) .

(3.4)

Sufficient conditions on the penalty function P_n(|β_j|) for the oracle property will be presented in Section 4. Penalization is needed because otherwise small coefficients in front of unimportant variables would be still kept in minimizing L_FGMM (β). As to become clearer in Section 6, the FGMM focuses on the model selection and estimation consistency without paying much effort to the efficient estimation of β₀_S.

3.2. Rationales behind the construction of FGMM

3.2.1. Inclusion of V(β)

We construct the FGMM criterion function using

V (β) = {(F {(β)}^{T}, H {(β)}^{T})}^{T} .

A natural question arises: why not just use one set of IV's so that V(β) = F(β)? We now explain the rationale behind the inclusion of the second set of instruments H(β). To simplify notation, let F_ij = f_j(W_i) and H_ij = h_j(W_i) for j ≤ p and i ≤ n. Then F_i = (F_i₁,…, F_ip) and H_i = (H_i₁,…, H_ip). Also write F_j = f_j (W) and H_j = h_j (W) for j ≤ p.

Let us consider a linear regression model (2.2) as an example. If H(β) were not included and V(β) = F(β) had been used, the GMM loss function would have been constructed as

L_{v} (β) = {‖ \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i}^{T} β) F_{i} (β) ‖}^{2},

(3.5)

where for the simplicity of illustration, J(β) is taken as an identity matrix. We also use the L₀-penalty P_n(|β_j|) = λ_nI_{(|β_j|≠0)} for illustration. Suppose that the true $β_{0} = {(β_{0 S}^{T}, 0, \dots, 0)}^{T}$ where only the first s components are nonzero and that s > 1. If we, however, restrict ourselves to β_p = (0, …, 0, β_p), the criterion function now becomes

Q_{FGMM} (β_{p}) = {[\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i p} β_{p}) F_{i p}]}^{2} + λ_{n} .

It is easy to see its minimum is just λ_n. On the other hand, if we optimize Q_FGMM on the oracle space $β = {(β_{S}^{T}, 0)}^{T}$ , then

min_{β = {(β_{S}^{T}, 0)}^{T}, β_{S, j} \neq 0} Q_{FGMM} (β) \geq s λ_{n} .

As a result, it is inconsistent for variable selection.

The use of L₀-penalty is not essential in the above illustration. The problem is still present if the L₁-penalty is used, and is not merely due to the biasedness of L₁-penalty. For instance, recall that for the SCAD penalty with hyper parameter (a, λ_n), P_n(·) is non-decreasing, and $P_{n} (t) = \frac{(a + 1)}{2} λ_{n}^{2}$ when t ≥ aλ_n. Given that min_j_∈_S|β₀_j|≫λ_n,

Q_{FGMM} (β_{0}) \geq \sum_{j \in S} P_{n} (| β_{0 j} |) \geq s P_{n} (min_{j \in S} | β_{0 j} |) = \frac{(a + 1)}{2} λ_{n}^{2} s .

On the other hand, $Q_{FGMM} (β_{p}^{*}) = P_{n} (| β_{p}^{*} |) \leq \frac{(a + 1)}{2} λ_{n}^{2}$ which is strictly less than Q_FGMM(β₀). So the problem is still present when an asymptotically unbiased penalty (e.g., SCAD, MCP) is used.

Including an additional term H(β) in V(β) can overcome this problem. For example, if we still restrict to β_p = (0,…, β_p) but include an additional but different IV H_ip, the criterion function then becomes, for the L₀ penalty:

Q_{FGMM} (β_{p}) = {[\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i p} β_{p}) F_{i p}]}^{2} + {[\frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - X_{i p} β_{p}) H_{i p}]}^{2} + λ_{n} .

In general, the first two terms cannot achieve o_p(1) simultaneously as long as the two sets of transformations {f_j(·)} and {h_j(·)} are fixed differently. so long as n is large and

{(E X_{p} F_{p})}^{- 1} E (Y F_{p}) \neq {(E X_{p} H_{p})}^{- 1} E (Y H_{p}) .

(3.6)

As a result, Q_FGMM(β_p) is bounded away from zero with probability approaching one.

To better understand the behavior of Q_FGMM (β), it is more convenient to look at the population analogues of the loss function. Because the number of equations in

E [(Y - X^{T} β) F (β)] = 0 and E [(Y - X^{T} β) H (β)] = 0

(3.7)

is twice as many as the number of unknowns (nonzero components in β), if we denote S̃ as the support of β, then (3.7) has a solution only when ${(E F_{\tilde{S}} X_{\tilde{S}}^{T})}^{- 1} E (Y F_{\tilde{S}}) = {(E H_{\tilde{S}} X_{\tilde{S}}^{T})}^{- 1} E (Y H_{\tilde{S}})$ , which does not hold in general unless S̃ = S, the index set of the true nonzero coefficients. Hence it is natural for (3.7) to have a unique solution β = β₀. As a result, if we define

G (β) = {‖ E (Y - X^{T} β) F (β) ‖}^{2} + {‖ E (Y - X^{T} β) H (β) ‖}^{2},

the population version of L_FGMM, then as long as β is not close to β₀, G should be bounded away from zero. Therefore, it is reasonable for us to assume that for any δ > 0, there is γ(δ) > 0 such that

inf_{{‖ β - β_{0} ‖}_{\infty} > δ, β \neq 0} G (β) > γ (δ) .

(3.8)

On the other hand, $E (ε | W) = E (Y - X_{S}^{T} β_{0 S} | W) = 0$ implies G(β₀) = 0.

Our FGMM loss function is essentially a sample version of G(β), so minimizing L_FGMM(β) forces the estimator to be close to β₀, but small coefficients in front of unimportant but exogenous regressors may still be allowed. Hence a concave penalty function is added to L_FGMM to define Q_FGMM.

3.2.2. Indicator function

Another question readers may ask is that why not define L_FGMM(β) to be, for some weight matrix J,

{[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i}]}^{T} J [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i}],

(3.9)

that is, why not replace the irregular β-dependent V(β) with V, and use the entire 2p-dimensional V = (F^T, H^T)^T as the IV? This is equivalent to the question why the indicator function in (3.3) cannot be dropped.

The indicator function is used to prevent the accumulation of estimation errors under the high dimensionality. To see this, rewrite (3.9) to be:

\sum_{j = 1}^{p} \frac{1}{\hat{var} (F_{j})} {(\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) F_{i j})}^{2} + \frac{1}{\hat{var} (H_{j})} {(\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) H_{i j})}^{2} .

Since dim(V_i) = 2p ≫ n, even if each individual term evaluated at β = β₀ is $O_{p} (\frac{1}{n})$ , the sum of p terms would become stochastically unbounded. In general, (3.9) does not converge to its population analogue when p ≫ n because the accumulation of high-dimensional estimation errors would have a non-negligible effect.

In contrast, the indicator function effectively reduces the dimension and prevents the accumulation of estimation errors. Once the indicator function is included, the proposed FGMM loss function evaluated at β₀ becomes:

\sum_{j \in S} \frac{1}{\hat{var} (F_{j})} {(\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β_{0}) F_{i j})}^{2} + \frac{1}{\hat{var} (H_{j})} {(\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β_{0}) H_{i j})}^{2},

which is small because E[g(Y, X^Tβ₀)F_S] = E[g(Y, X^Tβ₀)H_S] = 0 and that there are only s = |S|₀ terms in the summation.

Recently, there has been growing literature on the shrinkage GMM, e.g., Caner (2009), Caner and Zhang (2013), Liao (2013), etc, regarding estimation and variable selection based on a set of moment conditions like (3.2). The model considered by these authors is restricted to either a low-dimensional parameter space or a low-dimensional vector of moment conditions, where there is no such a problem of error accumulations.

4. Oracle Property of FGMM

FGMM involves a non-smooth loss function. In the appendix, we develop a general asymptotic theory for high-dimensional models to accommodate the non-smooth loss function.

Our first assumption defines the penalty function we use. Consider a similar class of folded concave penalty functions as that in Fan and Li (2001).

For any β = (β₁,…, β_s)^T ∈ ℝ^s, and |β_j| ≠ 0, j = 1,…, s, define

η (β) = \underset{∊ \to 0^{+}}{lim sup} max_{j \leq s} sup_{\begin{matrix} t_{1} < t_{2} \\ (t_{1}, t_{2}) \in (| β_{j} | - ∊, | β_{j} | + ∊) \end{matrix}} - \frac{P_{n}^{'} (t_{2}) - P_{n}^{'} (t_{1})}{t_{2} - t_{1}},

(4.1)

which is ${max}_{j \leq s} - P_{n}^{″} (| β_{j} |)$ if the second derivative of P_n is continuous. Let

d_{n} = \frac{1}{2} min {| β_{0 j} | : β_{0 j} \neq 0, j = 1, \dots, p}

represent the strength of signals.

Assumption 4.1

The penalty function P_n(t) : [0, ∞) → ℝ satisfies:

P_n(0) = 0
P_n(t) is concave, non-decreasing on [0, ∞), and has a continuous derivative $P_{n}^{'} (t)$ when t > 0.
$\sqrt{s} P_{n}^{'} (d_{n}) = o (d_{n})$ .
There exists c > 0 such that sup_{β∈B(β_0S,cd_n)} η(β) = o(1).

These conditions are standard. The concavity of P_n(·) implies that η(β) ≥ 0 for all β ∈ ℝ^s. It is straightforward to check that with properly chosen tuning parameters, the L_q penalty (for q ≤ 1), hard-thresholding (Antoniadis 1996), SCAD (Fan and Li 2001), and MCP (Zhang 2010) all satisfy these conditions. As thoroughly discussed by Fan and Li (2001), a penalty function that is desirable for achieving the oracle properties should result in an estimator with three properties: unbiasedness, sparsity and continuity (see Fan and Li 2001 for details). These properties motivate the needs of using a folded concave penalty.

The following assumptions are further imposed. Recall that for j ≤ p, F_j = f_j (W) and H_j = h_j (W).

Assumption 4.2

The true parameter β₀ is uniquely identified by E(g(Y, X^T β₀)|W) = 0.
(Y₁, X₁₎,…, (Y_n, X_n) are independent and identically distributed.

Remark 4.1

Condition (i) above is standard in the GMM literature (e.g., Newey 1993, Donald et al. 2009, Kitamura et al. 2004). This condition is closely related to the “over-identifying restriction”, and ensures that we can always find two sets of transformations F and H such that the equations in (3.2) are uniquely satisfied by β_S = β_0S. In linear models, this is a reasonable assumption, as discussed in Section 3.2. In nonlinear models, however, requiring the identifiability from either E(g(Y, X^Tβ₀)|W) = 0 or (3.2) may be restrictive. Indeed, Dominguez and Lobato 2004) showed that the identification condition in (i) may depend on the marginal distributions of W. Furthermore, in nonparametric regression problems as in Bickel et al. (2009) and Ai and Chen (2003), the sufficient condition of Condition (i) is even more complicated, which also depends on the conditional distribution of X|W, and is known to be statistically untestable (see Newey and Powell 2003, Canay et al 2013).

Assumption 4.3

There exist b₁, b₂, b₃ > 0 and r₁, r₂, r₃ > 0 such that for any t > 0,

P(|g(Y, X^Tβ₀)| > t) ≤ exp(−(t/b₁)^r₁),
max_l≤p P(|F_l| > t) ≤ exp(−(t/b₂)^r₂), max_l≤p P(|H_l > t) ≤ exp(−(t/b₃)^r^₃).
min_j_∈_S var(g(Y, X^T β₀)F_j) and min_j_∈_S var(g(Y, X^T β₀)H_j) are bounded away from zero.
var(F_j) and var(H_j) are bounded away from both zero and infinity uniformly in j = 1,…, p and p ≥ 1.

We will assume g(·,·) to be twice differentiable, and in the following assumptions, let

m (t_{1}, t_{2}) = \frac{\partial g (t_{1}, t_{2})}{\partial t_{2}}, q (t_{1}, t_{2}) = \frac{\partial^{2} g (t_{1}, t_{2})}{\partial t_{2}^{2}}, V_{S} = (\begin{matrix} F_{S} \\ H_{S} \end{matrix}) .

Assumption 4.4

g(·,·) is twice differentiable.
sup_{t₁, t₂} |m(t₁, t₂)| < ∞, and sup_{t₁, t₂} |q(t₁, t₂)| <∞.

It is straightforward to verify Assumption 4.4 for linear, logistic and probit regression models.

Assumption 4.5

There exist C₁ > 0 and C₂ > 0 such that

\begin{matrix} λ_{max} [(E m (Y, X_{S}^{T} β_{0 S}) X_{S} V_{S}^{T}) {(E m (Y, X_{S}^{T} β_{0 S}) X_{S} V_{S}^{T})}^{T}] < C_{1} . \\ λ_{min} [(E m (Y, X_{S}^{T} β_{0 S}) X_{S} V_{S}^{T}) {(E m (Y, X_{S}^{T} β_{0 S}) X_{S} V_{S}^{T})}^{T}] < C_{2}; \end{matrix}

These conditions require that the instrument V_S be not weak, that is, V_S should not be weakly correlated with the important regressors. In the generalized linear model, Assumption 4.5 is satisfied if proper conditions on the design matrices are imposed. For example, in the linear regression model and probit model, we assume the eigenvalues of $(E X_{S} V_{S}^{T}) {(E X_{S} V_{S}^{T})}^{T}$ and $(E ϕ (X^{T} β_{0}) X_{S} V_{S}^{T}) {(E ϕ (X^{T} β_{0}) X_{S} V_{S}^{T})}^{T}$ are bounded away from both zero and infinity respectively, where ϕ(·) is the standard normal density function. Conditions in the same spirit are also assumed in, e.g., Bradic et al. (2011), and Fan and Lv (2011).

Define

ϒ = var (g (Y, X_{S}^{T} β_{0 S}) V_{S}) .

(4.2)

Assumption 4.6

For some c > 0, λ_min(ϒ) > c.
$s P_{n}^{'} (d_{n}) + s \sqrt{(log p) / n} + s^{3} (log s) / n = o (P_{n}^{'} (0^{+}))$ , $P_{n}^{'} (d_{n}) s^{2} = O (1)$ , and $s \sqrt{(log p) / n} = o (d_{n})$ .
$P_{n}^{'} (d_{n}) = o (1 / \sqrt{n s})$ and sup_{‖β − β_0S ‖≤d_n/4}η(β)=o((s log p)^−1/2).
${max}_{j \notin S} ‖ E m (y, X^{T} β_{0}) X_{j} V_{S} ‖ \sqrt{(log s) / n} = o (P_{n} (0^{+}))$ .

This assumption imposes a further condition jointly on the penalty, the strength of the minimal signal and the number of important regressors. Condition (i) is needed for the asymptotic normality of the estimated nonzero coefficients. When either SCAD or MCP is used as the penalty function with a tuning parameter λ_n, $P_{n}^{'} (d_{n}) = {sup}_{‖ β - β_{0 S} ‖ \leq d_{n} / 4} η (β) = 0$ and $P_{n}^{'} (0^{+}) = λ_{n}$ when λ_n = o(d_n). Thus Conditions (ii)-(iv) in the assumption are satisfied as long as $s \sqrt{log p / n} + s^{3} log s / n ≪ λ_{n} ≪ d_{n}$ . This requires the signal d_n be strong and s be small compared to n. Such a condition is needed to achieve the variable selection consistency.

Under the foregoing regularity conditions, we can show the oracle property of a local minimizer of Q_FGMM (3.4).

Theorem 4.1

Suppose s³ log p = o(n). Under Assumptions 4.1-4.6, there exists a local minimizer $\hat{β} = {({\hat{β}}_{S}^{T}, {\hat{β}}_{N}^{T})}^{T}$ of Q_FGMM(β) with β̂_S and β̂_N being sub-vectors of β̂ whose coordinates are in S and S^c respectively, such that:

$\sqrt{n} α^{T} Γ^{- 1 / 2} \sum ({\hat{β}}_{S} - β_{0 S}) \to^{d} N (0, 1),$

for any unit vector α ∈ ℝ^s, ‖α‖ = 1, where $A = E m (Y, X^{T} β_{0}) X_{S} V_{S}^{T}$ ,

$Γ = 4 A J (β_{0}) ϒ J (β_{0}) A^{T}, and \sum = 2 A J β_{0}) A^{T} .$
$lim_{n \to \infty} P ({\hat{β}}_{N} = 0) = 1 .$

In addition, the local minimizer β̂ is strict with probability at least 1 – δ for an arbitrarily small δ > 0 and all large n.
Let Ŝ = {j ≤ p : β̂_j ≠ 0}. Then

$P (\hat{S} = S) \to 1 .$

Remark 4.2

As was shown in an earlier version of this paper Fan and Liao (2012), when it is known that E[g(Y, X^Tβ₀)|X_S] = 0 but likely E[g(Y, X^Tβ₀)|X] ≠ 0, we can take V = (F^T, H^T)^T to be transformations of X that satisfy Assumptions 4.3-4.6. In this way, we do not need an extra instrumental variable W, and Theorem 4.1 still goes through, while the traditional methods (e.g., penalized least squares in the linear model) can still fail as shown by Theorem 2.2. In the high-dimensional linear model, compared to the classical assumption: E(ε|X) = 0, our condition E(ε| X_S) = 0 is relatively easier to validate as X_S is a low-dimensional vector.

Remark 4.3

We now explain our required lower bound on the signal $s \sqrt{log p / n} = o (d_{n})$ . When a penalized regression is used, which takes the form ${min}_{β} L_{n} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |)$ , it is required that if L_n (β) is differentiable, ${max}_{j \notin S} | \partial L_{n} (β_{0}) / \partial β_{j} | = o (P_{n}^{'} (0^{+}))$ . This often leads to a requirement of the lower bound of d_n. Therefore, such a lower bound of d_n depends on the choice of both the loss function L_n(β) and the penalty. For instance, in the linear model when least squares with a SCAD penalty is employed, this condition is equivalent to $\sqrt{log p / n} = o (d_{n})$ . It is also known that the adaptive lasso penalty requires the minimal signal to be significantly larger than $\sqrt{log p / n}$ (Huang, Ma and Zhang 2008). In our framework, the requirement $s \sqrt{log p} / n = o (d_{n})$ arises from the use of the new FGMM loss function. Such a condition is stronger than that of the least squares loss function, which is the price paid to achieve variable selection consistency in the presence of endogeneity. This condition is still easy to satisfy as long as s grows slowly with n.

Remark 4.4

Similar to the “irrpresentable condition” for Lasso, the FGMM requires important and unimportant explanatory variables not be strongly correlated. This is fulfilled by Assumption 4.6(iv). For instance, in the linear model and V_S contains X_S as in our earlier version, this condition implies ${max}_{j \notin S} ‖ E X_{j} X_{S} ‖ \sqrt{log s / n} = o (λ_{n})$ . Strong correlation between (X_S, X_N) is also ruled out by the identifiability condition Assumption 4.2. To illustrate the idea, consider a case of perfect linear correlation: $X_{S}^{T} α - X_{N}^{T} δ = 0$ for some (α, δ) with δ ≠ 0. Then, $X^{T} β_{0} = X_{S}^{T} (β_{0 S} - α) + X_{N}^{T} δ$ . As a result, the FGMM can be variable selection inconsistent because β₀ and (β₀_S − α, δ) are observationally equivalent, violating Assumption 4.2.

5. Global minimization

With the over identification condition, we can show that the local minimizer in Theorem 4.1 is nearly global. To this end, define an l_∞ ball centered at β₀ with radius δ:

Θ_{δ} = {β \in R^{p} : | β_{i} - β_{0 i} | < δ, i = 1, \dots, p} .

Assumption 5.1 (over-identification)

For any δ > 0, there is γ > 0 such that

lim_{n \to \infty} P (inf_{β \notin Θ_{δ} \cup {0}} {‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i} (β) ‖}^{2} > γ) = 1 .

This high-level assumption is hard to avoid in high-dimensional problems. It is the empirical counterpart of (3.8). In classical low-dimensional regression models, this assumption has often been imposed in the econometric literature, e.g., Andrews (1999), Chernozhukov and Hong (2003), among many others. Let us illustrate it by the following example.

Example 5.1

Consider a linear regression model of low dimensions: $E (Y - X_{S}^{T} β_{0 S} | W) = 0$ , which implies $E [(Y - X_{S}^{T} β_{0 S}) F_{S}] = 0$ and $E [(Y - X_{S}^{T} β_{0 S}) H_{S}] = 0$ where p is either bounded or slowly diverging with n. Now consider the following problem:

min_{β \neq 0} G (β) \equiv min_{β \neq 0} {‖ E (Y - X^{T} β) F (β) ‖}^{2} + {‖ E (Y - X^{T} β) H (β) ‖}^{2} .

Once ${[E F_{\tilde{S}} X_{\tilde{S}}^{T}]}^{- 1} E [F_{\tilde{S}} Y] \neq {[E H_{\tilde{S}} X_{\tilde{S}}^{T}]}^{- 1} E [H_{\tilde{S}} Y]$ for all index set S̃ ≠ S, the objective function is then minimized to zero uniquely by β = β₀. Moreover, for any δ > 0 there is γ > 0 such that when β ∉ Θ_δ ∪ {0}, we have G(β) > γ > 0. Assumption 5.1 then follows from the uniform weak law of large number: with probability approaching one, uniformly in (β ∉ Θ_δ ∪ {0},

{‖ \frac{1}{n} \sum_{i = 1}^{n} F_{i} (β) (Y_{i} - X_{i}^{T} β) ‖}^{2} + {‖ \frac{1}{n} \sum_{i = 1}^{n} H_{i} (β) (Y_{i} - X_{i}^{T} β) ‖}^{2} > γ / 2 .

When p is much larger than n, the accumulation of the fluctuations from using the law of large number is no longer negligible. It is then challenging to show that ‖E[g(Y_i, X^Tβ)V(β)] ‖ is close to $‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i} (β) ‖$ uniformly for high-dimensional β's, which is why we impose Assumption 5.1 on the empirical counterpart instead of the population.

Theorem 5.1

Assume ${max}_{j \in S} P_{n}^{'} (| β_{0 j} |) = o (s^{- 1})$ . Under Assumption 5.1 and those of Theorem 4.1, the local minimizer β̂ in Theorem 4.1 satisfies: for any δ > 0, there exists γ > 0,

lim_{n \to \infty} P (Q_{FGMM} (\hat{β}) + γ < inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β)) = 1 .

The above theorem demonstrates that β̂ is a nearly global minimizer. For SCAD and MCP penalties, the condition ${max}_{j \in S} P_{n}^{'} (| β_{0 j} |) = o (s^{- 1})$ holds when λ_n = o(s⁻¹), which is satisfied if s is not large.

Remark 5.1

We exclude the set {0} from the searching area in both Assumption 5.1 and Theorem 5.1 because we do not include the intercept in the model so X(0) = 0 by definition, and hence Q_FGMM(0) = 0. It is reasonable to believe that zero is not close to the true parameter, since we assume there should be at least one important regressor in the model. On the other hand, if we always keep X₁ = 1 to allow for an intercept, there is no need to remove {0} in either Assumption 5.1 or the above theorem. Such a small change is not essential.

Remark 5.2

Assumption 5.1 can be slightly relaxed so that γ is allowed to decay slowly at a certain rate. The lower bound of such a rate is given by Lemma D.2 in the appendix. Moreover, Theorem 5.1 is based on an over-identification assumption, which is essentially different from the global minimization theory in the recent high-dimensional literature, e.g., Zhang (2010), Bühlmann and van de Geer (2011, ch 9), and Zhang and Zhang (2012).

6. Semi-parametric efficiency

The results in Section 5 demonstrate that the choice of the basis functions {f_j, h_j}_j≤p forming F and H influences the asymptotic variance of the estimator. The resulting estimator is in general not efficient. To obtain a semi-parametric efficient estimator, one can employ a second step post-FGMM procedure. In the linear regression, a similar idea has been used by Belloni and Chernozhukov (2013).

After achieving the oracle properties in Theorem 4.1, we have identified the important regressors with probability approaching one, that is,

\hat{S} = {j : {\hat{β}}_{j} \neq 0}, {\hat{X}}_{S} = (X_{j} : j \in \hat{S}), P (\hat{S} = S) \to 1 .

This reduces the problem to a low-dimensional problem. For simplicity, we restrict s = O(1). The problem of constructing semi-parametric efficient estimator (in the sense of Newey (1990) and Bickel et al. (1998)) in a low-dimensional model

E [g (Y, X_{S}^{T} β_{0 S}) | W] = 0

has been well studied in the literature (see, for example, Chamberlain (1987), Newey (1993)). The optimal instrument that leads to the semi-parametric efficient estimation of β_0S is given by D(W)σ(W)⁻², where

D (W) = E (\frac{\partial g (Y, X_{S}^{T} β_{0 S})}{\partial β_{S}} | W), σ {(W)}^{2} = E (g {(Y, X_{S}^{T} β_{0 S})}^{2} | W) .

Newey (1993) showed that the semi-parametric efficient estimator of β_0S can be obtained by GMM with the moment condition:

E [g (Y, X_{S}^{T} β_{0 S}) σ {(W)}^{- 2} D (W)] = 0 .

(6.1)

In the post-FGMM procedure, we replace X_S with the selected X̂_S obtained from the first-step penalized FGMM. Suppose there exist consistent estimators D̂(W) and σ̂ (W)² of D(W) and σ(W)². Let us assume the true parameter ‖β₀_S‖_∞ < M for a large constant M > 0. We then estimate β_0S by solving

ρ_{n} (β_{S}) = \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, {\hat{X}}_{i S}^{T} β_{S}) \hat{σ} {(W_{i})}^{- 2} \hat{D} (W_{i}) = 0,

(6.2)

on {β_S : ‖β₀_S‖_∞ ≤ M}, and the solution is assumed to be unique.

Assumption 6.1

There exist C₁ > 0 and C₂ > 0 so that

$C_{1} < inf_{w \in χ} σ {(w)}^{2} \leq sup_{w \in χ} σ {(w)}^{2} < C_{2} .$

In addition, there exist σ̂(w)² and D̂ (w) such that

$sup_{w \in χ} | \hat{σ} {(w)}^{2} - σ {(w)}^{2} | = o_{p} (1), and sup_{w \in χ} ‖ \hat{D} (w) - D (w) ‖ = o_{p} (1)$

where χ is the support of W.
$E ({sup}_{{‖ β ‖}_{\infty} \leq M} g {(Y, X_{S}^{T} β_{S})}^{4}) < \infty .$

The consistent estimators for D(w) and σ(w)² can be obtained in many ways. We present a few examples below.

Example 6.1 (Homoskedasticity)

Suppose $Y = h (X_{S}^{T} β_{0 S}) + ε$ for some nonlinear function h(·). Then σ(w)² = E(ε²|W = w) = σ², which does not depend on w under homoskedasticity. In this case, equations (6.1) and (6.2) do not depend on σ².

Example 6.2 (Simultaneous linear equations)

In the simultaneous linear equation model, X_S linearly depends on W as:

g (Y, X_{S}^{T} β_{S}) = Y - X_{S}^{T} β_{S}, X_{S} = Π W + u

for some coefficient matrix Π, where u is independent of W. Then D(w) = E(X_S|W = w) = Πw. Let X̂ = (X̂_S1, …, X̂_Sn), W̄ = (W₁, …, W_n). We then estimate D(w) by Π̂w, where Π̂ = (X̂W̄^T)(W̄W̄)⁻¹.

Example 6.3 (Semi-nonparametric estimation)

We can also assume a semi-parametric structure on the functional forms of D(w) and σ(w)²:

D (w) = D (w; θ_{1}), σ {(w)}^{2} = σ^{2} (w; θ_{2}),

where D(·;θ₁) and σ²(·;θ₂) are semi-parametric functions parameterized by θ₁ and θ₂. Then D(w) and σ(w)² are estimated using a standard semi-parametric method. More generally, we can proceed by a pure nonparametric approach via respectively regressing $\partial g (Y, {\hat{X}}_{S}^{T} {\hat{β}}_{S}) / \partial β_{S}$ and $g {(Y, {\hat{X}}_{S}^{T} {\hat{β}}_{S})}^{2}$ on W, provided that the dimension of W is either bounded or growing slowly with n (see Fan and Yao, 1998).

Theorem 6.1

Suppose s = O(1), Assumption 6.1 and those of Theorem 4.1 hold. Then

\sqrt{n} ({\hat{β}}_{S}^{*} - β_{0 S}) \to^{d} N (0, {[E (σ {(W)}^{- 2} D (W) D {(W)}^{T})]}^{- 1}),

and [E(σ(W)⁻²D(W)D(W)^T)]^–1 is the semi-parametric efficiency bound in Chamberlain (1987).

7. Implementation

We now discuss the implementation for numerically minimizing the penalized FGMM criterion function.

7.1. Smoothed FGMM

As we previously discussed, including an indicator function benefits us in dimension reduction. However, it also makes L_FGMM unsmooth. Hence, minimizing Q_FGMM(β) = L_FGMM(β)+Penalty is generally NP-hard.

We overcome this discontinuity problem by applying the smoothing technique as in Horowitz (1992) and Bondell and Reich (2012), which approximates the indicator function by a smooth kernel K : (−∞,∞) → ℝ that satisfies

0 ≤ K(t) < M for some finite M and all t ≥ 0.
K(0) = 0 and lim_|t|→∞ K(t) = 1.
lim sup_|t|→∞ |K′(t)t| = 0, and lim sup_|t|→∞ |K″(t)t²| < ∞.

We can set $K (t) = \frac{F (t) - F (0)}{1 - F (0)}$ , where F(t) is a twice differentiable cumulative distribution function. For a pre-determined small number h_n, L_FGMM is approximated by a continuous function L_K(β) with the indicator replaced by $K (β_{j}^{2} / h_{n})$ . The objective function of the smoothed FGMM is given by

Q_{K} (β) = L_{K} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |) .

As h_n→0⁺, $K (β_{j}^{2} / h_{n})$ converges to I_{(β_j≠0)}, and hence L_K(β) is simply a smoothed version of L_FGMM (β). As an illustration, Figure 1 plots such a function.

Smoothing the indicator function is often seen in the literature on high-dimensional variable selections. Recently, Bondell and Reich (2012) approximate I_(t≠0) by $\frac{(h_{n} + 1) t}{h_{n} + t}$ to obtain a tractable non-convex optimization problem. Intuitively, we expect that the smoothed FGMM should also achieve the variable selection consistency. Indeed, the following theorem formally proves this claim.

Theorem 7.1

Suppose $h_{n}^{1 - γ} = o (d_{n}^{2})$ for a small constant γ ∈ (0, 1). Under the assumptions of Theorem 4.1 there exists a local minimizer β̂′ of the smoothed FGMM Q_K(β) such that, for ${\hat{S}}^{'} = {j \leq p : {\hat{β}}_{j}^{'} \neq 0}$ ,

P ({\hat{S}}^{'} = S) \to 1 .

In addition, the local minimizer β̂′ is strict with probability at least 1 − δ for an arbitrarily small δ > 0 and all large n.

The asymptotic normality of the estimated nonzero coefficients can be established very similarly to that of Theorem 4.1, which is omitted for brevity.

7.2. Coordinate descent algorithm

We employ the iterative coordinate algorithm for the smoothed FGMM minimization, which was used by Fu (1998), Daubechies et al. (2004), Fan and Lv (2011), etc. The iterative coordinate algorithm minimizes one coordinate of β at a time, with other coordinates kept fixed at their values obtained from previous steps, and successively updates each coordinate. The penalty function can be approximated by local linear approximation as in Zou and Li (2008).

Specifically, we run the regular penalized least squares to obtain an initial value, from which we start the iterative coordinate algorithm for the smoothed FGMM. Suppose β^(l) is obtained at step l. For k ∈ {1, …, p}, denote by $β_{(- k)}^{(l)}$ a (p − 1)-dimensional vector consisting of all the components of (β^(l) but $β_{k}^{(l)}$ . Write $(β_{(- k)}^{(l)}, t)$ as the p-dimensional vector that replaces $β_{k}^{(l)}$ with t. The minimization with respect to t while keeping $β_{(- k)}^{(l)}$ fixed is then a univariate minimization problem, which is not difficult to implement. To speed up the convergence, we can also use the second order approximation of $L_{K} (β_{(- k)}^{(l)}, t)$ along the kth component at $β_{k}^{(l)}$ :

L_{K} (β_{(- k)}^{(l)}, t) \approx L_{K} (β^{(l)}) + \frac{\partial L_{K} (β^{(l)})}{\partial β_{k}} (t - β_{k}^{(l)}) + \frac{1}{2} \frac{\partial^{2} L_{K} (β^{(l)})}{\partial β_{k}^{2}} {(t - β_{k}^{(l)})}^{2} \equiv L_{K} (β^{(l)}) + {\hat{L}}_{K} (β_{(- k)}^{(l)}, t),

(7.1)

where ${\hat{L}}_{K} (β_{(- k)}^{(l)}, t)$ is a quadratic function of t. We solve for

t^{*} = arg min_{t} {\hat{L}}_{K} (β_{(- k)}^{(l)}, t) + P_{n}^{'} (| β_{k}^{(l)} |) | t |,

(7.2)

which admits an explicit analytical solution, and keep the remaining components at step l. Accept t* as an updated kth component of β^(l) only if $L_{K} (β^{(l)}) + \sum_{j = 1}^{p} P_{n} (| β_{j}^{(l)} |)$ strictly decreases. The coordinate descent algorithm runs as follows.

Set l = 1. Initialize β⁽¹⁾ = β̂*, where β̂* solves

$min_{β \in R^{p}} \frac{1}{n} \sum_{i = 1}^{n} {[g (Y_{i}, X_{i}^{T} β)]}^{2} + \sum_{j = 1}^{p} P_{n} (| β_{j} |)$

using the coordinate descent algorithm as in Fan and Lv (2011).
Successively for k = 1, …, p, let t* be the minimizer of

$min_{t} {\hat{L}}_{K} (β_{(- k)}^{(l)}, t) + P_{n}^{'} (| β_{k}^{(l)} |) | t | .$

Update $β_{k}^{(l)}$ as t* if

$L_{K} (β_{(- k)}^{(l)}, t^{*}) + P_{n} (| t^{*} |) < L_{K} (β^{(l)}) + P_{n} (| β_{k}^{(l)} |) .$

Otherwise set $β_{k}^{(l)} = β_{k}^{(l - 1)}$ . Increase l by one when k = p.
Repeat Step 2 until | Q_K(β^(l))−Q_K (β^(l+1))| < ε, for a pre-determined small ε.

When the second order approximation (7.1) is combined with SCAD in Step 2, the local linear approximation of SCAD is not needed. As demonstrated in Fan and Li (2001), when P_n(t) is defined using SCAD, the penalized optimization of the form ${min}_{β \in R} \frac{1}{2} {(z - β)}^{2} + Λ P_{n} (| β |)$ has an analytical solution.

We can show that the evaluated objective values {Q_K(β^(l))}_l≥1 is a bounded Cauchy sequence. Hence for an arbitrarily small ε > 0, the above algorithm stops after finitely many steps. Let M(β) denote the map defined by the algorithm from β^(l) to β^(l+1). We define a stationary point of the function Q_K(β) to be any point β at which the gradient vector of Q_K(β) is zero. Similar to the local linear approximation of Zou and Li (2008), we have the following result regarding the property of the algorithm.

Theorem 7.2

The sequence {Q_K (β^(l))}_l≥1 is a bounded non-increasing Cauchy sequence. Hence for any arbitrarily small ε > 0, the coordinate descent algorithm will stop after finitely many iterations. In addition, if Q_K (β) = Q_K (M(β)) only for stationary points of Q_K(·) and if β* is a limit point of the sequence {(β^(l)) _l≥1, then β* is a stationary point of Q_K (β).

Theoretical analysis of non-convex regularization in the recent decade has focused on numerical procedures that can find local solutions (Hunter and Li 2005, Kim et al. 2008, Brehenry and Huang 2011). Proving that the algorithm achieves a solution that possesses the desired oracle properties is technically difficult. Our simulated results demonstrate that the proposed algorithm indeed reaches the desired sparse estimator. Further investigation along the lines of Zhang and Zhang (2012) and Loh and Wainwright (2013) is needed to investigate the statistical properties of the solution to non-convex optimization problems, which we leave as future research.

8. Monte Carlo Experiments

8.1. Endogeneity in both important and unimportant regressors

To test the performance of FGMM for variable selection, we simulate from a linear model:

Y = X^{T} β_{0} + ε, (β_{01}, \dots, β_{05}) = (5, - 4, 7, - 2, 1.5); β_{0 j} = 0, for 6 \leq j \leq p

with p = 50 or 200. Regressors are classified as being exogenous (independent of ε) and endogenous. For each component of X, we write $X_{j} = X_{j}^{e}$ if X_j is endogenous, and $X_{j} = X_{j}^{x}$ if X_j is exogenous, and $X_{j}^{e}$ and $X_{j}^{x}$ are generated according to:

X_{j}^{e} = (F_{j} + H_{j} + 1) (3 ε + 1), X_{j}^{x} = F_{j} + H_{j} + u_{j},

where {ε, u₁, …, u_p} are independent N(0, 1). Here F = (F₁, …, F_p)^T and H = (H₁, …, H_p)^T are the transformations (to be specified later) of a three-dimensional instrumental variable W = (W₁, W₂, W₃)^T ∼ N₃(0, I₃). There are m endogenous variables (X₁, X₂, X₃, X₆, …, X_2+m)^T, with m = 10 or 50. Hence three of the important regressors (X₁, X₂, X₃) are endogenous while two are exogenous (X₄, X₅).

We apply the Fourier basis as the working instruments:

\begin{matrix} F = \sqrt{2} {sin (j π W_{1}) + sin (j π W_{2}) + sin (j π W_{3}) : j \leq p}, \\ H = \sqrt{2} {cos (j π W_{1}) + cos (j π W_{2}) + cos (j π W_{3}) : j \leq p} . \end{matrix}

The data contain n = 100 i.i.d. copies of (Y, X, F, H). PLS and FGMM are carried out separately for comparison. In our simulation we use SCAD with pre-determined tuning parameters of λ as the penalty function. The logistic cumulative distribution function with h = 0.1 is used for smoothing:

F (t) = \frac{exp (t)}{1 + exp (t)}, K (\frac{β_{j}^{2}}{h}) = 2 F (\frac{β_{j}^{2}}{h}) - 1 .

There are 100 replications per experiment. Four performance measures are used to compare the methods. The first measure is the mean standard error (MSE_S) of the important regressors, determined by the average of ‖β̂_S − β_0S‖ over the 100 replications, where S = {1, …, 5}. The second measure is the average of the MSE of unimportant regressors, denoted by MSE_N. The third measure is the number of correctly selected non-zero coefficients, that is, the true positive (TP), and finally, the fourth measure is the number of incorrectly selected coefficients, the false positive (FP). In addition, the standard error over the 100 replications of each measure is also reported. In each simulation, we initiate β⁽⁰⁾ = (0,…, 0)^T, and run a penalized least squares (SCAD(λ)) for λ = 0.5 to obtain the initial value for the FGMM procedure. The results of the simulation are summarized in Table 2, which compares the performance measures of PLS and FGMM.

Table 2. Endogeneity in both important and unimportant regressors, n = 100.

	PLS			FGMM
	λ = 1	λ =3	λ = 4	λ = 0.08	λ = 0.1	λ = 0.3	post-FGMM
				p = 50 m = 10
MSE_S	0.190 (0.102)	0.525 (0.283)	0.491 (0.328)	0.106 (0.051)	0.097 (0.043)	0.102 (0.037)	0.088 (0.026)
MSE_N	0.171 (0.059)	0.240 (0.149)	0.183 (0.149)	0.090 (0.030)	0.085 (0.035)	0.048 (0.034)
TP	5 (0)	5 (0)	4.97 (0.171)	5 (0)	5 (0)	5 (0)
FP	27.69 (6.260)	14.63 (5.251)	10.37 (4.539)	3.76 (1.093)	3.5 (1.193)	1.63 (1.070)

				p = 200 m = 50
MSE_S	0.831 (0.787)	0.966 (0.595)	1.107 (0.678)	0.111 (0.048)	0.104 (0.041)	0.231 (0.431)	0.092 (0.032)
MSE_N	1.286 (1.333)	0.936 (0.799)	0.828 (0.656)	0.062 (0.018)	0.063 (0.021)	0.053 (0.075)
TP	5 (0)	4.9 (0.333)	4.73 (0.468)	5 (0)	5 (0)	4.94 (0.246)
FP	86.760 (27.41)	42.440 (15.08)	35.070 (13.84)	4.726 (1.358)	4.276 (1.251)	2.897 (2.093)

Open in a new tab

m is the number of endogenous regressors. MSE_S is the average of ‖β̂_S − β_0S‖ for nonzero coefficients. MSE_N is the average of ‖β̂_N − β_0N‖ for zero coefficients. TP is the number of correctly selected variables; FP is the number of incorrectly selected variables, and m is the total number of endogenous regressors. The standard error of each measure is also reported.

PLS has non-negligible false positives (FP). The average FP decreases as the magnitude of the penalty parameter increases, however, with a relatively large MSE_S for the estimated nonzero coefficients, and the FP rate is still large compared to that of FGMM. The PLS also misses some important regressors for larger λ. It is worth noting that the larger MSE_S for PLS is due to the bias of the least squares estimation in the presence of endogeneity. In contrast, FGMM performs well in both selecting the important regressors, and in correctly eliminating the unimportant regressors. The average MSE_S of FGMM is significantly less than that of PLS since the instrumental variable estimation is applied instead. In addition, after the regressors are selected by the FGMM, the post-FGMM further reduces the mean squared error of the estimators.

8.2. Endogeneity only in unimportant regressors

Consider a similar linear model but only the unimportant regressors are endogenous and all the important regressors are exogenous, as designed in Section 2.2, so the true model is as the usual case without endogeneity. In this case, we apply (F, H) = (X, X²) as the working instruments for FGMM with SCAD(λ) penalty, and need only data X and Y = (Y₁, …, Y_n). We still compare the FGMM procedure with PLS. The results are reported in Table 3.

Table 3. Endogeneity only in unimportant regressors, n = 200.

	PLS			FGMM
	λ = 0.1	λ = 0.5	λ = 1	λ = 0.05	λ = 0.1	λ = 0.2
	p = 50
MSE_S	0.133 (0.043)	0.629 (0.301)	1.417 (0.329)	0.261 (0.094)	0.184 (0.069)	0.194 (0.076)
MSE_N	0.068 (0.016)	0.072 (0.016)	0.095 (0.019)	0.001 (0.010)	0 (0)	0.001 (0.009)
TP	5 (0)	4.82 (0.385)	3.63 (0.504)	5 (0)	5 (0)	5 (0)
FP	35.36 (3.045)	8.84 (3.334)	2.58 (1.557)	0.08 (0.337)	0 (0)	0.02 (0.141)

	p = 300
MSE_S	0.159 (0.054)	0.650 (0.304)	1.430 (0.310)	0.274 (0.086)	0.187 (0.102)	0.193 (0.123)
MSE_N	0.107 (0.019)	0.071 (0.023)	0.086 (0.027)	5 × 10⁻⁴(0.006)	0 (0)	5 × 10⁻⁴(0.005)
TP	5 (0)	4.82 (0.384)	3.62 (0.487)	5 (0)	5 (0)	4.99 (0.100)
FP	210.47 (11.38)	42.78 (11.773)	7.94 (5.635)	0.11 (0.37)	0 (0)	0.01 (0.10)

Open in a new tab

It is clearly seen that even though only the unimportant regressors are endogenous, however, the PLS still does not seem to select the true model correctly. This illustrates the variable selection inconsistency for PLS even when the true model has no endogeneity. In contrast, the penalized FGMM still performs relatively well.

8.3. Weak minimal signals

To study the effect on variable selection when the strength of the minimal signal is weak, we run another set of simulations with the same data generating process as in Design 1 but we change β₄ = −0.5 and β₅ = 0.1, and keep all the remaining parameters the same as before. The minimal nonzero signal becomes |β₅| = 0.1. Three of the important regressors are endogenous as in Design 1. Table 4 indicates that the minimal signal is so small that it is not easily distinguishable from the zero coefficients.

Table 4. FGMM for weak minimal signal β₄ = −0.5, β₅ = 0.1.

	p = 50	m = 10		p = 200	m = 50
	λ = 0.05	λ = 0.1	λ = 0.5	λ = 0.05	λ = 0.1	λ = 0.5
MSE_S	0.128 (0.020)	0.107 (0.000)	0.118 (0.056)	0.138 (0.061)	0.125 (0.074)	0.238 (0.154)
MSE_N	0.155 (0.054)	0.097 (0.000)	0.021 (0.033)	0.134 (0.052)	0.108 (0.043)	0.084 (0.062)
TP	4.12 (0.327)	4 (0)	4 (0)	4.04 (0.281)	3.98 (0.141)	3.8 (0.402)
FP	4.93 (1.578)	5 (0)	2.08 (0.367)	4.72 (1.198)	4.3 (0.948)	1.95 (1.351)

Open in a new tab

9. Conclusion and Discussion

Endogeneity can arise easily in the high-dimensional regression due to a large pool of regressors, which causes the inconsistency of the penalized least-squares methods and possible false scientific discoveries. Based on the over-identification assumption and valid instrumental variables, we propose to penalize an FGMM loss function. It is shown that FGMM possesses the oracle property, and the estimator is also a nearly global minimizer.

We would like to point out that this paper focuses on correctly specified sparse models, and the achieved results are “pointwise” for the true model. An important issue is the uniform inference where the sparse model may be locally misspecified. While the oracle property is of fundamental importance for high-dimensional methods in many scientific applications, it may not enable us to make valid inference about the coefficients uniformly across a large class of models (Leeb and Pötscher 2008, Belloni et al. 2013)⁵. Therefore, the “post-double-selection” method with imperfect model selection recently proposed by Belloni et al. (2013) is important for making uniform inference. Research along that line under high-dimensional endogeneity is important and we shall leave it for the future agenda.

Finally, as discussed in Bickel et al. (2009) and van de Geer (2008), high-dimensional regression problems can be thought of as an approximation to a nonparametric regression problem with a “dictionary” of functions or growing number of sieves. Then in the presence of endogenous regressors, model (3.1) is closely related to the non-parametric conditional moment restricted model considered by, e.g., Newey and Powell (2003), Ai and Chen (2003), and Chen and Pouzo (2008). While the penalization in the latter literature is similar to ours, it plays a different role and is introduced for different purposes. It will be interesting to find the underlying relationships between the two models.

Supplementary Material

suppl

NIHMS649193-supplement-suppl.pdf^{(200.3KB, pdf)}

Acknowledgments

We would like to thank the anonymous reviewers, Associate Editor and Editor for their helpful comments that helped to improve the paper.

This project was supported by the National Science Foundation grant DMS-1206464 and the National Institute of General Medical Sciences of the National Institutes of Health through Grant Numbers R01GM100474 and R01-GM072611. The bulk of the research was carried out while Yuan Liao was a postdoctoral fellow at Princeton University.

Appendix A: Proofs for Section 2

Throughout the Appendix, C will denote a generic positive constant that may be different in different uses. Let sgn(·) denote the sign function.

A.1. Proof of Theorem 2.1

Proof. When β̂ is a local minimizer of Q_n(β), by the Karush-Kuhn-Tucker (KKT) condition, ∀l ≤ p,

\frac{\partial L_{n} (\hat{β})}{\partial β_{l}} + v_{l} = 0,

where $v_{l} = P_{n}^{'} (| {\hat{β}}_{l} |) sgn ({\hat{β}}_{l})$ if β̂_l ≠ 0; $v_{l} \in [- P_{n}^{'} (0^{+}), P_{n}^{'} (0^{+})]$ if β̂_l = 0, and we denote $P_{n}^{'} (0^{+}) = {lim}_{t \to 0^{+}} P_{n}^{'} (t)$ . By the monotonicity of $P_{n}^{'} (t)$ , we have $| \partial L_{n} (\hat{β}) / \partial β_{l} | \leq P_{n}^{'} (0^{+})$ . By Taylor expansion and the Cauchy-Schwarz inequality, there is β̃ on the segment joining β̂ and β₀ so that, on the event β̂_N = 0, (β̂_j − β_0j = 0 for all j ∉ S)

| \frac{\partial L_{n} (\hat{β})}{\partial β_{l}} - \frac{\partial L_{n} (β_{0})}{\partial β_{l}} | = | \sum_{j = 1}^{p} \frac{\partial^{2} L_{n} (\tilde{β})}{\partial β_{l} \partial β_{j}} ({\hat{β}}_{j} - β_{0 j}) | = | \sum_{j \in S} \frac{\partial^{2} L_{n} (\tilde{β})}{\partial β_{l} \partial β_{j}} ({\hat{β}}_{j} - β_{0 j}) | .

The Cauchy-Schwarz inequality then implies that max_l≤p |∂L_n(β̂)/∂β_l − ∂L_n(β₀)/∂β_l| is bounded by

max_{l, j \leq p} | \frac{\partial^{2} L_{n} (\tilde{β})}{\partial β_{l} \partial β_{j}} | {‖ {\hat{β}}_{S} - β_{0 S} ‖}_{1} \leq max_{l, j \leq p} | \frac{\partial^{2} L_{n} (\tilde{β})}{\partial β_{l} \partial β_{j}} | \sqrt{s} ‖ {\hat{β}}_{S} - β_{0 S} ‖ .

By our assumption, √s‖β̂_S − β _0S‖ = o_p(1). Because P(β̂_N = 0) → 1,

max_{l \leq p} | \frac{\partial L_{n} (\hat{β})}{\partial β_{l}} - \frac{\partial L_{n} (β_{0})}{\partial β_{l}} | \to^{p} 0 .

(A.1)

This yields that ∂L_n(β₀)/∂β_l = o_p(1).

A.2. Proof of Theorem 2.2

Proof. Let ${X_{i l}}_{i = 1}^{n}$ be the i.i.d. data of X_l where X_l is an endogenous regressor. For the penalized LS, $L_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - X_{i}^{T} β)}^{2}$ . Under the theorem assumptions, by the strong law of large number $\partial_{β_{l}} L_{n} (β_{0}) = - \frac{2}{n} \sum_{i = 1}^{n} X_{i l} (Y_{i} - X_{i}^{T} β_{0}) \to - 2 E (X_{l} ε)$ almost surely, which does not satisfy (2.1) of Theorem 2.1.

Appendix B: General Penalized Regressions

We present some general results for the oracle properties of penalized regressions. These results will be employed to prove the oracle properties for the proposed FGMM. Consider a penalized regression of the form:

min_{β \in ℝ^{p}} L_{n} (β) + \sum_{j = 1}^{p} P_{n} (| β_{j} |),

Lemma B.1

Under Assumption 4.1, if (β = (β₁,…, β_s)^T is such that max_j≤s |β_j − β_0S,j| ≤ d_n, then

| \sum_{j = 1}^{s} P_{n} (| β_{j} |) - P_{n} (| β_{0 S, j} |) | \leq ‖ β - β_{0 S} ‖ \sqrt{s} P_{n}^{'} (d_{n}) .

Proof. By Taylor's expansion, there exists β* ( $β_{j}^{*} \neq 0$ for each j) lying on the line segment joining β and β_0S, such that

\sum_{j = 1}^{s} (P_{n} (| β_{j} |) - P_{n} (| β_{0 S, j} |) = {(P_{n}^{'} (| β_{1}^{*} |) sgn (β_{1}^{*}), \dots, P_{n}^{'} (| β_{s}^{*} |) sgn (β_{s}^{*}))}^{T} (β - β_{0 S}) \leq ‖ β - β_{0 S} ‖ \sqrt{s} max_{j \leq s} P_{n}^{'} (| β_{j}^{*} |) .

Then $min {| β_{j}^{*} | : j \leq s} \geq min {| β_{0 S, j} | : j \leq s} - {max}_{j \leq s} | β_{j}^{*} - β_{0 S, j} | \geq 2 d_{n} - d_{n} = d_{n}$ .

Since $P_{n}^{'}$ is non-increasing (as P_n is concave), $P_{n}^{'} (| β_{j}^{*} |) \leq P_{n}^{'} (d_{n})$ for all j ≤ s. Therefore $\sum_{j = 1}^{s} (P_{n} (| β_{j} |) - P_{n} (| β_{0 S, j} |) \leq ‖ β - β_{0 S} ‖ \sqrt{s} P_{n}^{'} (d_{n})$ .

In the theorems below, with S = {j : β_0j ≠ 0}, define a so-called “oracle space” ℬ = {β ∈ ℝ^p : β_j = 0 if j ∉ S}. Write L_n(β_S, 0) = L_n(β) for $β = {(β_{S}^{T}, 0)}^{T} \in ℬ$ . Let β_S = (β_S1,…, β_Ss) and

\nabla_{S} L_{n} (β_{S}, 0) = {(\frac{\partial L_{n} (β_{S}, 0)}{\partial β_{S 1}}, \dots, \frac{\partial L_{n} (β_{S}, 0)}{\partial β_{S s}})}^{T} .

Theorem B.1 (Oracle Consistency)

Suppose Assumption 4.1 holds. In addition, suppose L_n(β_S, 0) is twice differentiable with respect to β_S in a neighborhood of β_0S restricted on the subspace ℬ, and there exists a positive sequence a_n = o(d_n) such that:

$‖ \nabla_{S} L_{n} (β_{0 S}, 0) ‖ = O_{p} (a_{n}) .$
For any ε > 0, there is C_ε > 0 so that for all large n,

$P (λ_{min} (\nabla_{S}^{2} L_{n} (β_{0 S}, 0)) > C_{ε}) > 1 - ε .$ (B.1)
For any ε > 0, δ > 0, and any nonnegative sequence α_n = o(d_n), there is N > 0 such that when n > N,

$P (sup_{‖ β_{S} - β_{0 S} ‖ \leq α_{n}} {‖ \nabla_{S}^{2} L_{n} (β_{S}, 0) - \nabla_{S}^{2} L_{n} (β_{0 S}, 0) ‖}_{F} \leq δ) > 1 - ε .$ (B.2)

Then there exists a local minimizer $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ of

Q_{n} (β_{S}, 0) = L_{n} (β_{S}, 0) + \sum_{j \in S} P_{n} (| β_{j} |)

such that $‖ {\hat{β}}_{S} - β_{0 S} ‖ = O_{p} (a_{n} + \sqrt{s} P_{n}^{'} (d_{n}))$ . In addition, for an arbitrarily small ε > 0, the local minimizer β̂ is strict with probability at least 1 − ε, for all large n.

Proof. The proof is a generalization of the proof of Theorem 3 in Fan and Lv (2011). Let $k_{n} = a_{n} + \sqrt{s} P_{n}^{'} (d_{n})$ . It is our assumption that k_n = o(1). Write Q₁(β_S) = Q_n(β_S, 0), and L₁(β_S) = L_n(β_S, 0). In addition, write

\nabla L_{1} (β_{S}) = \frac{\partial L_{n}}{\partial β_{S}} (β_{S}, 0), and \nabla^{2} L_{1} (β_{S}) = \frac{\partial^{2} L_{n}}{\partial β_{S} β_{S}^{T}} (β_{S}, 0) .

Define Inline graphic _τ = {β ∈ ℝ^s : ‖ β − β_0S‖ ≤k_nτ} for some τ > 0. Let ∂ _τ denote the boundary of _τ. Now define an event

H_{n} (τ) = {Q_{1} (β_{0 S}) < min_{β_{S} \in \partial N_{τ}} Q_{1} (β_{S})} .

On the event H_n(τ), by the continuity of Q₁, there exists a local minimizer of Q₁ inside Inline graphic _τ. Equivalently, there exists a local minimizer ${({\hat{β}}_{S}^{T}, 0)}^{T}$ of Q_n restricted on $ℬ = {β = {(β_{S}^{T}, 0)}^{T}}$ inside ${β = {(β_{S}^{T}, 0)}^{T} : β_{S} \in N_{τ}}$ . Therefore, it suffices to show that ∀ε > 0, there exists τ > 0 so that P(H_n(τ)) > 1 − ε for all large n, and that the local minimizer is strict with probability arbitrarily close to one.

For any β_S ∈ ∂ Inline graphic _τ, which is ‖β_S − β_0S‖ = k_nτ, there is β* lying on the segment joining β_S and β_0S such that by the Taylor's expansion on L₁ (β_S):

Q_{1} (β_{S}) - Q_{1} (β_{0 S}) = {(β_{S} - β_{0 S})}^{T} \nabla L_{1} (β_{0 S}) + \frac{1}{2} {(β_{S} - β_{0 S})}^{T} \nabla^{2} L_{1} (β^{*}) (β_{S} - β_{0 S}) + \sum_{j = 1}^{s} [P_{n} (| β_{S j} |) - P_{n} (| β_{0 S, j} |)] .

By Condition (i) ‖∇L₁(β_0S)‖ = O_p(a_n), for any ε > 0, there exists C₁ > 0, so that the event H₁ satisfies P(H₁) > 1 − ε/4 for all large n, where

H_{1} = {{(β_{S} - β_{0 S})}^{T} \nabla L_{1} (β_{0 S}) \geq - C_{1} ‖ β_{S} - β_{0 S} ‖ a_{n}} .

(B.3)

In addition, Condition (ii) yields that there exists C_ε > 0 such that the following event H₂ satisfies P(H₂) ≥ 1 − ε/4 for all large n, where

H_{2} = {{(β_{S} - β_{0 S})}^{T} \nabla^{2} L_{1} (β_{0 S}) (β_{S} - β_{0 S}) > C_{ε} {‖ β_{S} - β_{0 S} ‖}^{2}} .

(B.4)

Define another event H₃ = {‖∇²L₁(β_0S) − ∇²L₁(β*)‖_F < C_ε/4}. Since ‖β_S − β_0S‖ = k_nτ, by Condition (B.2) for any τ > 0, P(H₃) > 1 — ε/4 for all large n. On the event H₂ ∩ H₃, the following event H₄ holds:

H_{4} = {{(β_{S} - β_{0 S})}^{T} \nabla^{2} L_{1} (β^{*}) (β_{S} - β_{0 S}) > \frac{3 C_{ε}}{4} {‖ β_{S} - β_{0 S} ‖}^{2}} .

By Lemma B.1, $\sum_{j = 1}^{s} [P_{n} (| β_{S j} |) - P_{n} (| β_{0 S, j} |)] \geq - \sqrt{s} P_{n}^{'} (d_{n}) ‖ β_{S} - β_{0 S} ‖$ . Hence for any β_S ∈ ∂ Inline graphic _τ, on H₁ ∩ H₄,

Q_{1} (β_{S}) - Q_{1} (β_{0 S}) \geq k_{n} τ (\frac{3 k_{n} τ C_{ε}}{8} - C_{1} a_{n} - \sqrt{s} P_{n}^{'} (d_{n})) .

For $k_{n} = a_{n} + \sqrt{s} P_{n}^{'} (d_{n})$ , we have $C_{1} a_{n} + \sqrt{s} P_{n}^{'} (d_{n}) \leq (C_{1} + 1) k_{n}$ . Therefore, we can choose τ > 8(C₁ + 1)/(3C_ε) so that Q₁(β_S)−Q₁(β_0S) ≥ 0 uniformly for β ∈ ∂ Inline graphic _τ. Thus for all large n, when τ > 8(C₁ + 1)/(3C_ε),

P (H_{n} (τ)) \geq P (H_{1} \cap H_{4}) \geq 1 - ε .

It remains to show that the local minimizer in Inline graphic _τ (denoted by β̂_S) is strict with a probability arbitrarily close to one. For each h ∈ ℝ/{0}, define

ψ (h) = \underset{ε \to 0^{+}}{lim sup} sup_{\underset{(t_{1}, t_{2}) \in (| h | - ε, | h | + ε)}{t_{1} < t_{2}}} - \frac{P_{n}^{'} (t_{2}) - P_{n}^{'} (t_{1})}{t_{2} - t_{1}} .

By the concavity of P_n(·), ψ(·) ≥ 0. We know that L₁ is twice differentiable on ℝ^s. For β_S ∈ Inline graphic _τ Let A(β_S) = ∇²L₁(β_S) − diag{ψ(β_S1), …, ψ(β_Ss)}. It suffices to show that A(β̂_S) is positive definite with probability arbitrarily close to one. On the event H₅ = {η(β̂_S) ≤ sup_{β∈B(β_0S,cd_n)} η(β)} (where cd_n is as defined in Assumption 4.1(iv)),

max_{j \leq s} ψ ({\hat{β}}_{S, j}) \leq η ({\hat{β}}_{S}) \leq sup_{β \in B (β_{0 S}, c d_{n})} η (β)} .

Also define events H₆ = {‖∇²L₁(β_S) − ∇²L₁(β_0S)‖_F < C_ε/4} and H₇ = {λ_min(∇²L₁(β_0S)) > C_ε. Then on H₅∩H₆∩H7, for any α ∈ ℝ^s satisfying ‖α‖ = 1, by Assumption 4.1(iv),

α^{T} A ({\hat{β}}_{S}) α \geq α^{T} \nabla^{2} L_{1} (β_{0 S}) α - | α^{T} (\nabla^{2} L_{1} ({\hat{β}}_{S}) - \nabla^{2} L_{1} (β_{0 S})) α | - max_{j \leq s} ψ ({\hat{β}}_{S, j}) \geq 3 C_{ε} / 4 - sup_{β \in B (β_{0 S}, d_{n})} η (β) \geq C_{ε} / 4

for all large n. This then implies λ_min(A(β̂_S)) ≥ C_ε/4 for all large n.

We know that P(λ_min[∇²L₁(β_0S)] > C_ε) > 1 − ε. It remains to show that P(H₅ ∩ H₆) > 1 — ε for arbitrarily small ε. Because k_n = o(d_n), for an arbitrarily small ε > 0, P(H₅) > P(β̂_S ∈ B(β_0S, cd_n)) ≥ 1 − ε/2 for all large n. Finally,

P (H_{6}^{c}) \leq P (H_{6}^{c}, ‖ {\hat{β}}_{S} - β_{0 S} ‖ \leq k_{n}) + P (‖ {\hat{β}}_{S} - β_{0 S} ‖ > k_{n}) \leq P (sup_{‖ β_{S} - β_{0 S} ‖ \leq k_{n}} {‖ \nabla^{2} L_{1} (β_{S}) - \nabla^{2} L_{1} (β_{0 S}) ‖}_{F} \geq C_{ε} / 4) + ε / 4 = ε / 2 .

The previous theorem assumes that the true support S is known, which is not practical. We therefore need to derive the conditions under which S can be recovered from the data with probability approaching one. This can be done by demonstrating that the local minimizer of Q_n restricted on ℬ is also a local minimizer on ℝ^p. The following theorem establishes the variable selection consistency of the estimator, defined as a local solution to a penalized regression problem on ℝ^p.

For any β ∈ ℝ^p, define the projection function

T β = {(β_{1}^{'}, β_{2}^{'}, \dots, β_{p}^{'})}^{T} \in ℬ, β_{j}^{'} = {\begin{matrix} β_{j} & if j \in S \\ 0, & if j \notin S \end{matrix} .

(B.5)

Theorem B.2(Variable selection)

Suppose L_n : ℝ^p → ℝ satisfies the conditions in Theorem B.1, and Assumption 4.1 holds. Assume the following Condition A holds:

Condition A: With probability approaching one, for β̂_S in Theorem B.1, there exists a neighborhood ℋ ⊂ ℝ^p of ${({\hat{β}}_{S}^{T}, 0)}^{T}$ , such that for all $β = {(β_{S}^{T}, β_{N}^{T})}^{T} \in ℋ$ but β_N ≠ 0,

L_{n} (T β) - L_{n} (β) < \sum_{j \notin S} P_{n} (| β_{j} |) .

(B.6)

Then (i) with probability approaching one, $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ is a local minimizer in ℝ^p of

Q_{n} (β) = L_{n} (β) + \sum_{i = 1}^{p} P_{n} (| β_{i} |)

(ii) For an arbitrarily small ε > 0, the local minimizer β̂ is strict with probability at least 1 − ε, for all large n.

Proof. Let $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ with β̂_S being the local minimizer of Q₁(β_S) as in Theorem B.1. We now show: with probability approaching one, there is a random neighborhood of β̂, denoted by ℋ, so that ∀β = (β_S, β_N) ∈ ℋ with β_N ≠ 0, we have Q_n(β̂) < Q_n(β). The last inequality is strict.

To show this, first note that we can take ℋ sufficiently small so that Q₁(β̂) ≤ Q₁(β_S) because β̂_S is a local minimizer of Q₁(β_S) from Theorem B.1. Recall the projection defined to be $T β = {(β_{S}^{T}, 0)}^{T}$ , and Q_n( Inline graphic β) = Q₁(β_S) by the definition of Q₁. We have Q_n(β̂) = Q₁(β̂_S) ≤ Q₁(β_S) = Q_n( β). Therefore, it suffices to show that with probability approaching one, there is a sufficiently small neighborhood of ℋ of β ^, so that for any $β = {(β_{S}^{T}, β_{N}^{T})}^{T} \in ℋ$ with β_N ≠ 0, Q_n( Inline graphic β) < Q_n(β).

In fact, this is implied by Condition (B.6):

Q_{n} (T β) - Q_{n} (β) = L_{n} (T β) - L_{n} (β) - (\sum_{j = 1}^{p} P_{n} (β_{j}) - \sum_{j = 1}^{s} P_{n} (| {(T β)}_{j} |)) < 0 .

(B.7)

The above inequality, together with the last statement of Theorem B.1 implies part (ii) of the theorem.

Appendix C: Proofs for Section 4

Throughout the proof, we write F_iS = F_i(β_0S), H_iS = H_i(β_0S) and $V_{i S} = {(F_{i S}^{T}, H_{i S}^{T})}^{T}$ .

Lemma C.1

${max}_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} {(F_{i j} - {\bar{F}}_{j})}^{2} - var (F_{j}) | = o_{p} (1) .$
${max}_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} {(H_{i j} - {\bar{H}}_{j})}^{2} - var (H_{j}) | = o_{p} (1) .$
sup_β∈ℝ^p λ_max(J(β)) = O_p(1), and λ_min(J(β₀)) is bounded away from zero with probability approaching one.

Proof. Parts (i)(ii) follow from an application of the standard large deviation theory by using Bernstein inequality and Bonferroni's method. Part (iii) follows from the assumption that var(F_j) and var(H_j) are bounded uniformly in j ≤ p.

C.1. Verifying conditions in Theorems B.1, B.2

C.1.1. Verifying conditions in Theorem B.1

For any β ∈ℝ^p, we can write $T β = {(β_{S}^{T}, 0)}^{T}$ . Define

{\tilde{L}}_{FGMM} (β_{S}) = {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{S}) V_{i S}]}^{T} J (β_{0}) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{S}) V_{i S}] .

Then L̃_FGMM (β_S) = L_FGMM(β_S, 0).

Condition (i)

$\nabla {\tilde{L}}_{FGMM} (β_{0 S}) = 2 A_{n} (β_{0 S}) J (β_{0}) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) V_{i S}]$ , where

A_{n} (β_{S}) \equiv \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i S}^{T} β_{S}) X_{i S} V_{i S}^{T} .

(C.1)

By Assumption 4.5, ‖A_n(β₀)‖ = O_p(1). In addition, the elements in J(β₀) are uniformly bounded in probability due to Lemma C.1. Hence $‖ \nabla {\tilde{L}}_{FGMM} (β_{0 S}) ‖ \leq O_{p} (1) ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) V_{i S} ‖$ . Due to $E g (Y, X_{s}^{T} β_{0 S}) V_{S} = 0$ , using the exponential-tail Bernstein inequality with Assumption 4.3 plus Bonferroni inequality, it can be shown that there is C > 0 such that for any t > 0,

P (max_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) F_{l i} | > t) < P (max_{l \leq p} P (| \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) F_{l i} | > t) \leq exp (log p - C t^{2} / n),

which implies ${max}_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) F_{l i} | = O_{p} (\sqrt{\frac{log p}{n}})$ . Similarly, ${max}_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) H_{l i} | = O_{p} (\sqrt{\frac{log p}{n}})$ . Hence $‖ \nabla {\tilde{L}}_{FGMM} (β_{0 S}) ‖ = O_{p} (\sqrt{(s log p) / n})$ .

Condition (ii)

Straightforward but tedious calculation yields

\nabla^{2} {\tilde{L}}_{FGMM} (β_{0 S}) = \sum (β_{0 S}) + M (β_{0 S}),

where Σ(β_0S) = 2A_n(β_0S)J(β₀)A_n(β_0S)^T, and M(β_0S) = 2Z(β_0S)B(β_0S), with (suppose X_iS = (X_il₁, …, X_{il_s})^T)

Z (β_{0 S}) = \frac{1}{n} \sum_{i = 1}^{n} q_{i} (Y_{i}, X_{i S} β_{0 S}) (X_{i l_{1}} X_{i S}, \dots, X_{i l_{s}} X_{i S}) V_{i S}^{T}, B (β_{0 S}) = J (β_{0}) \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) V_{i S} .

It is not hard to obtain ${‖ B (β_{0 S}) ‖}_{F} = O_{p} (\sqrt{s log p / n})$ , and ‖Z(β_0S)‖_F = O_p(s), and hence ${‖ M (β_{0 S}) ‖}_{F} = O_{p} (s \sqrt{s log p / n}) = o_{p} (1)$ .

Moreover, there is a constant C > 0, $P ({min}_{j \in S} \hat{var} {(X_{j})}^{- 1} > C) > 1 - ε$ and $P ({min}_{j \leq p} \hat{var} {(X_{j}^{2})}^{- 1} > C) > 1 - ε$ for all large n and any ε > 0. This then implies P(λ_min[J(β₀)] > C) > 1 − ε. Recall Assumption 4.5 that λ_min(EA_n(β_0S)EA_n(β_0S)^T) > C₂ for some C₂ > 0. Define events

G_{1} = {λ_{min} [J (β_{0})] > C}, G_{2} = {‖ M (β_{0 S}) ‖ < C_{2} C / 5} G_{3} = {‖ A_{n} (β_{0 S}) A_{n} {(β_{0 S})}^{T} - E A_{n} (β_{0 S}) E A_{n} {(β_{0 S})}^{T}) ‖ < C_{2} / 5} .

Then on the event $\cap_{i = 1}^{3} G_{i}$ ,

λ_{min} [\nabla^{2} {\tilde{L}}_{FGMM} (β_{0 S})] \geq 2 λ_{min} (J (β_{0})) λ_{min} (A_{n} (β_{0 S}) A_{n} {(β_{0 S})}^{T}) - {‖ M (β_{0 S}) ‖}_{F} \geq 2 C [λ_{min} (E A_{n} (β_{0 S}) E A_{n} {(β_{0 S})}^{T}) - C_{2} / 5] - C_{2} C / 5 \geq 7 C C_{2} / 5 .

Note that $P (\cap_{i = 1}^{3} G_{i}) \geq 1 - \sum_{i = 1}^{3} P (G_{i}^{c}) \geq 1 - 3 ε$ . Hence Condition (B.1) is then satisfied.

Condition (iii)

It can be shown that for any nonnegative sequence α_n = o(d_n) where d_n = min_k∈S |β_0k|/2, we have

P (sup_{‖ β_{S} - β_{0 S} ‖ \leq α_{n}} {‖ M (β_{S}) - M (β_{0 S}) ‖}_{F} \leq δ) > 1 - ε .

(C.2)

holds for any ε and δ > 0. As for Σ(β_S), note that for all β_S such that ‖β_S − β_0S‖ < d_n/2, we have β_S,k ≠ 0 for all k ≤ s. Thus J(β_S) = J(β_0S). Then P(sup_{‖β_S−β_0S‖<α_n}‖Σ(β_S) − Σ(β_0S)‖_F ≤ δ) > 1 − ε holds since P(sup_{‖β_S−β_0S‖<α_n}‖A_n(β_S) − A_n(β_0S)‖_F ≤ δ) > 1 − ε.

C.1.2. Verifying conditions in Theorem B.2

Proof. We verify Condition A of Theorem B.2, that is, with probability approaching one, there is a random neighborhood ℋ of $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ , such that for any $β = {(β_{S}^{T}, β_{N}^{T})}^{T} \in ℋ$ with β_N ≠ 0, condition (B.6) holds.

Let F( Inline graphic β) = {F_l : l ∈ S, β_l ≠ 0} and H( β) = {H_l: l ∈ S, β_l ≠ 0} for any fixed $β = {(β_{S}^{T}, β_{N}^{T})}^{T}$ . Define

Ξ (β) = {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) F_{i} (T β)]}^{T} J_{1} (T β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) F_{i} (T β)] + {[\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) H_{i} (T β)]}^{T} J_{2} (T β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) H_{i} (T β)]

where J₁( Inline graphic β) and J₂( β) are the upper-|S|₀ and lower-|S|₀ sub matrices of J( β). Hence L_FGMM( (β)) = Ξ( β). Then L_FGMM(β) − Ξ(β) equals

\sum_{l \notin S, β_{l} \neq 0} [w_{l 1} {(\frac{1}{n} \sum_{i = 1}^{n} g (y_{i}, X_{i}^{T} β) F_{i l})}^{2} + w_{l 2} {(\frac{1}{n} \sum_{i = 1}^{n} g (y_{i}, X_{i}^{T} β) H_{i l})}^{2}],

where w_l1 = 1/var̂(F_l and w_l2 = 1/var̂(H_l. So L_FGMM(β) ≥ Ξ(β). This then implies L_FGMM( Inline graphic β) − L_FGMM(β) ≤ Ξ( β) − Ξ(β). By the mean value theorem, there exists λ ∈ (0,1), for $h = {(β_{S}^{T}, - λ β_{N}^{T})}^{T}$ ,

Ξ (T β) - Ξ (β) = \sum_{l \notin S, β_{l} \neq 0} β_{l} {[\frac{1}{n} \sum_{i = 1}^{n} X_{i l} m (Y_{i}, X_{i}^{T} h) F_{i} (T β)]}^{T} J_{1} (T β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} h) F_{i} (T β)] + \sum_{l \notin S, β_{l} \neq 0} β_{l} {[\frac{1}{n} \sum_{i = 1}^{n} X_{i l} m (Y_{i}, X_{i}^{T} h) H_{i} (T β]}^{T} J_{2} (T β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} h) H_{i} (T β)] \equiv \sum_{l \notin S, β_{l} \neq 0} β_{l} (a_{l} (β) + b_{l} (β)) .

Let ℋ be a neighborhood of $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ (to be determined later). We have shown that Ξ ( Inline graphic β) − Ξ (β) = Σ_{l∉S,β_l≠0} β_l(a_l(β) + b_l(β)), for any β ∈ ℋ,

a_{l} (β) = {[\frac{1}{n} \sum_{i = 1}^{n} X_{i l} m (Y_{i}, X_{i}^{T} h) F_{i} (T β)]}^{T} J_{1} (T β) [\frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} h) F_{i} (T β)],

and b_l(β) is defined similarly based on H. Note that h lies in the segment joining β and Inline graphic β, and is determined by β, hence should be understood as a function of β. By our assumption, there is a constant M, such that |m(t₁, t₂) | and |q(t₁, t₂)|, the first and second partial derivatives of g, and $E X_{l}^{2} F_{k}^{2}$ are all bounded by M uniformly in t₁, t₂ and l, k ≤ p. Therefore the Cauchy-Schwarz and triangular inequalities imply

{‖ \frac{1}{n} \sum_{i = 1}^{n} X_{i l} m (Y_{i}, X_{i}^{T} h) F_{i} (T β) ‖}^{2} \leq M^{2} max_{l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} {‖ X_{i l} F_{i S} ‖}^{2} - E {‖ X_{l} F_{S} ‖}^{2} | + M^{2} max_{l \notin S} E {‖ X_{l} F_{S} ‖}^{2} .

Hence there is a constant M₁ such that if we define the event (again, keep in mind that h is determined by β)

B_{n} = {sup_{β \in ℋ} ‖ \frac{1}{n} \sum_{i = 1}^{n} X_{i l} m (Y_{i}, X_{i}^{T} h) F_{i} (T β) ‖ < \sqrt{s} M_{1}, sup_{β \in ℋ} ‖ J_{1} (T β) ‖ < M_{1}},

then P(B_n) → 1. In addition with probability one,

‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} h) F_{i} (T β) ‖ \leq sup_{β \in ℋ} ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) F_{i S} ‖ \leq sup_{β \in ℋ} ‖ \frac{1}{n} \sum_{i = 1}^{n} [g (Y_{i}, X_{i}^{T} β) - g (Y_{i}, X_{i}^{T} \hat{β})] F_{i S} ‖ + ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} \hat{β}) F_{i S} ‖ \equiv Z_{1} + Z_{2},

where, $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ . For some deterministic sequence r_n (to be determined later), we can define the above ℋ to be dddd

ℋ = {β : ‖ β - \hat{β} ‖ < r_{n} / p}

then sup_β∈ℋ‖β − β̂‖₁ < r_n. By the mean value theorem and Cauchy Schwarz inequality, there is β̃:

Z_{1} = sup_{β \in ℋ} ‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i}^{T} \tilde{β}) F_{i S} X_{i}^{T} (β - \hat{β}) ‖ \leq \sqrt{s} sup_{β \in ℋ} {‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i}^{T} \tilde{β}) F_{i S} X_{i}^{T} ‖}_{\infty} r_{n} \leq M \sqrt{s} max_{k \in S, l \leq p} | \frac{1}{n} \sum_{i = 1}^{n} {(F_{i k} X_{i l})}^{2} |^{1 / 2} r_{n} .

Hence there is a constant M₂ such that P(Z₁ < M₂√sr_n) → 1.

Let $ε_{i} = g (Y_{i}, X_{i}^{T} β_{0})$ . By the triangular inequality and mean value theorem, there are h̃ and $\tilde{\tilde{h}}$ lying in the segment between β̂ and β₀ such that

Z_{2} \leq ‖ \frac{1}{n} \sum_{i = 1}^{n} ε_{i} F_{i S} ‖ + ‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i}^{T} \tilde{h}) F_{i S} X_{i S}^{T} ({\hat{β}}_{S} - β_{0 S}) ‖ \leq \sqrt{s} max_{j \leq p} | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} F_{i j} | + ‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i}^{T} β_{0}) F_{i S} X_{i S}^{T} ({\hat{β}}_{S} - β_{0 S}) ‖ + ‖ \frac{1}{n} \sum_{i = 1}^{n} q (Y_{i}, X_{i}^{T} \tilde{\tilde{h}}) X_{i S}^{T} (β_{0 S} - {\tilde{h}}_{S}) F_{i S} X_{i S}^{T} ({\hat{β}}_{S} - β_{0 S}) ‖ \leq O_{p} (\sqrt{s log p / n}) + (o_{p} (1) + ‖ E m (Y, X^{T} β_{0}) F_{S} X_{S}^{T} ‖) ‖ {\hat{β}}_{S} - β_{0 S} ‖ {(\frac{1}{n} \sum_{i = 1}^{n} {‖ q (Y_{i}, X_{i}^{T} \tilde{\tilde{h}}) X_{i S} ‖}^{2})}^{1 / 2} {(\frac{1}{n} \sum_{i = 1}^{n} {‖ X_{i S} ‖}^{2} {‖ F_{i S} ‖}^{2})}^{1 / 2} {‖ {\hat{β}}_{S} - β_{0 S} ‖}^{2},

where we used the assumption that $‖ E m (Y, X^{T} β_{0}) X_{S} F_{S}^{T} ‖ = O (1)$ . We showed that $‖ \nabla {\tilde{L}}_{FGMM} (β_{0 S}) ‖ = O_{p} (\sqrt{(s log p)} / n)$ in the proof of verifying conditions in Theorem B.1. Hence by Theorem B.1, $‖ {\hat{β}}_{S} - β_{0 S} ‖ = O_{p} (\sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n}))$ . Thus

Z_{2} = O_{p} (\sqrt{\frac{s log p}{n}} + \sqrt{s} P_{n}^{'} (d_{n}) + \frac{s^{2} \sqrt{s} log s}{n} + s^{2} \sqrt{s} P_{n}^{'} {(d_{n})}^{2}) \equiv O_{p} (ξ_{n}) .

By the assumption $\sqrt{s} ξ_{n} = o (P_{n}^{'} (0^{+}))$ , hence $P (Z_{2} < P_{n}^{'} (0^{+}) / (8 \sqrt{s} M_{1}^{2})) \to 1$ , where M₁ is defined in the event B_n. Consequently, if we define an event $D_{n} = {Z_{1} < M_{2} \sqrt{s} r_{n}, Z_{2} < P_{n}^{'} (0^{+}) / (8 \sqrt{s} M_{1}^{2})}$ , then P(B_n ∩ D_n → 1, and on the event B_n ∩ D_n,

sup_{β \in ℋ} | a_{l} (β) | \leq M_{1}^{2} \sqrt{s} (M_{2} \sqrt{s} r_{n} + P_{n}^{'} (0^{+})) / (8 \sqrt{s} M_{1}^{2}) = M_{1}^{2} M_{2} s r_{n} + P_{n}^{'} (0^{+}) / 8 .

We can $r_{n} < P_{n}^{'} (0^{+}) / (8 M_{1}^{2} M_{2} s)$ , and thus ${sup}_{β \in ℋ} | a_{l} (β) | \leq P_{n}^{'} (0^{+}) / 4$ .

On the other hand, Because ( Inline graphic β)_j = β_j for either j ∈ S or β_j = 0, there exists λ₂ ∈ (0,1),

\sum_{j = 1}^{p} (P_{n} (| β_{j} |) - P_{n} (| {(T β)}_{j} |) = \sum_{j \notin S} P_{n} (| β_{J} |) = \sum_{l \notin S, β_{l} \neq 0} | β_{l} | P_{n}^{'} (λ_{2} | β_{l} |) .

For all l ∉ S, |β_l| ≤ ‖β − β₀‖₁ < r_n. Due to the non-increasingness of $P_{n}^{'} (t)$ , $\sum_{l \notin S} P_{n} (| β_{l} |) \geq \sum_{l \notin S, β_{l} \neq 0} | β_{l} | P_{n}^{'} (r_{n})$ . We can make r_n further smaller so that $P_{n}^{'} (r_{n}) \geq P_{n}^{'} (0^{+}) / 2$ which is satisfied for example, when r_n < λ_n if SCAD(λ_n) is used as the penalty. Hence

\sum_{l \notin S} β_{l} a_{l} (β) \leq \sum_{l \notin S} | β_{l} | \frac{P_{n}^{'} (0^{+})}{4} \leq \sum_{l \notin S} | β_{l} | \frac{P_{n}^{'} (r_{n})}{2} \leq \frac{1}{2} \sum_{l \notin S} P_{n} (| β_{l} |) .

Using the same argument we can show $\sum_{l \notin S} β_{l} b_{l} (β) \leq \frac{1}{2} \sum_{l \notin S} P_{n} (| β_{l} |)$ . Hence L_FGMM( Inline graphic β) − L_FGMM(β) < Σ_{l∉S,β_l≠0} β_l (a_l(β)+b_l(β)) ≤ Σ_l∉S P_n(|β_l|) for all β ∈ {β : ‖β − β̂‖₁ < r_n} under the event B_n ∩ D_n. Here r_n is such that $r_{n} < P_{n}^{'} (0^{+}) / (8 M_{1}^{2} M_{2} s)$ and $P_{n}^{'} (r_{n}) \geq P_{n}^{'} (0^{+}) / 2$ . This proves Condition A of Theorem B.2 due to P(B_n ∩ D_n) → 1.

C.2. Proof of Theorem 4.1: parts (ii) (iii)

We apply Theorem B.2 to infer that with probability approaching one, $\hat{β} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ is a local minimizer of Q_FGMM(β). Note that under the event that ${({\hat{β}}_{S}^{T}, 0)}^{T}$ is a local minimizer of Q_FGMM (β), we then infer that Q_n (β) has a local minimizer ${({\hat{β}}_{S}^{T}, {\hat{β}}_{N}^{T})}^{T}$ such that β̂_N = 0. This reaches the conclusion of part (ii). This also implies P(Ŝ ⊂ S) → 1.

By Theorem B.1, and $‖ \nabla {\tilde{L}}_{FGMM} (β_{0 S}) ‖ = O_{p} (\sqrt{(s log p) / n})$ as proved in verifying conditions in Theorem B.1, we have ‖β_0S − β̂_S‖ = o_p(d_n). So

P (S ⊄ \hat{S}) = P (\exists j \in S, {\hat{β}}_{j} = 0) \leq P (\exists j \in S, | β_{0 j} - {\hat{β}}_{j} | \geq | β_{0 j} |) \leq P (max_{j \in S} | β_{0 j} - {\hat{β}}_{j} | \geq d_{n}) \leq P (‖ β_{0 S} - {\hat{β}}_{S} ‖ \geq d_{n}) = o (1)

This implies P(S ⊂ Ŝ) → 1. Hence P(Ŝ = S) → 1.

C.3. Proof of Theorem 4.1: part (i)

Let $P_{n}^{'} (| {\hat{β}}_{S} |) = {(P_{n}^{'} (| {\hat{β}}_{S 1} |), \dots, P_{n}^{'} (| {\hat{β}}_{S s} |))}^{T}$ .

Lemma C.2

Under Assumption 4.1,

‖ P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) ‖ = O_{p} (max_{‖ β_{S} - β_{0 S} ‖ \leq d_{n} / 4} η (β) \sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n})),

where ∘ denotes the element-wise product.

Proof. Write $P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) = {(v_{1}, \dots, v_{s})}^{T}$ , where $v_{i} = P_{n}^{'} (| {\hat{β}}_{S i} |) sgn ({\hat{β}}_{S i})$ . By the triangular inequality and Taylor expansion,

| v_{i} | \leq | P_{n}^{'} (| {\hat{β}}_{S i} |) - P_{n}^{'} (| β_{0 S, i} |) | + P_{n}^{'} (| β_{0 S, i} |) \leq η (β^{*}) | {\hat{β}}_{S i} - β_{0 S, i} | + P_{n}^{'} (d_{n})

where β* lies on the segment joining β̂_S and β_0S. For any ε > 0 and all large n,

P (η (β^{*}) > max_{‖ β_{S} - β_{0 S} ‖ \leq d_{n} / 4} η (β)) \leq P (‖ {\hat{β}}_{S} - β_{0 S} ‖ > d_{n} / 4) < ε .

This implies η(β*) = O_p(max‖_{β_S−β_0S‖<d_n/4} η(β)). Therefore, ${‖ P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) ‖}^{2} = \sum_{i = 1}^{s} v_{j}^{2}$ is upper-bounded by

2 max_{‖ β_{S} - β_{0 S} ‖ \leq d_{n} / 4} η {(β)}^{2} {‖ {\hat{β}}_{S} - β_{0 S} ‖}^{2} + 2 s P_{n}^{'} {(d_{n})}^{2},

which implies the result since $‖ {\hat{β}}_{S} - β_{0 S} ‖ = O_{p} (\sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n}))$ .

Lemma C.3

Let Ω_n = √nΓ^−1/2. Then for any unit vector α ∈ℝ^s,

α^{T} Ω_{n} \nabla {\tilde{L}}_{FGMM} (β_{0 S}) \to^{d} N (0, 1) .

Proof. We have ∇L̃_FGMM(β_0S) = 2A_n(β_0S)J(β₀)B_n, where $B_{n} = \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) V_{i S}$ . We write $A = E m (Y, X_{s}^{T} β_{0 S}) X_{S} V_{S}^{T}$ , $ϒ = var (\sqrt{n} B_{n}) = var (g (Y, X_{S}^{T} β_{0 S}) V_{S})$ and Γ = 4AJ(β₀)ϒJ(β₀)^TA^T.

By the weak law of large number and central limit theorem for iid data,

‖ A_{n} (β_{0 S}) - A ‖ = o_{p} (1), \sqrt{n} {\tilde{α}}^{T} ϒ^{- 1 / 2} B_{n} \to^{d} N (0, 1) .

for any unit vector α̃∈ℝ^2s. Hence by the Slutsky's theorem,

\sqrt{n} α^{T} Γ^{- 1 / 2} \nabla {\tilde{L}}_{FGMM} (β_{0 S}) \to^{d} N (0, 1) .

Proof of Theorem 4.1: part (i)

Proof. The KKT condition of β̂_S gives

- P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) = \nabla {\tilde{L}}_{FGMM} ({\hat{β}}_{S}),

(C.3)

By the mean value theorem, there exists β* lying on the segment joining β_0S and β̂_S such that

\nabla {\tilde{L}}_{FGMM} ({\hat{β}}_{S}) = \nabla {\tilde{L}}_{FGMM} (β_{0 S}) + \nabla^{2} {\tilde{L}}_{FGMM} (β^{*}) ({\hat{β}}_{S} - β_{0 S}) .

Let D = (∇²L̃_FGMM(β*) − ∇²L̃_FGMM(β_0S))(β̂_S − β_0S). It then follows from (C.3) that for $Ω_{n} = \sqrt{n} Γ_{n}^{- 1 / 2}$ , and any unit vector α,

α^{T} Ω_{n} \nabla^{2} {\tilde{L}}_{FGMM} (β_{0 S}) ({\hat{β}}_{S} - β_{0 S}) = - α^{T} Ω_{n} [P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) + \nabla {\tilde{L}}_{FGMM} (β_{0 S}) + D] .

In the proof of Theorem 4.1, condition (ii), we showed that ∇²L̃_FGMM(β_0S) = Σ + O_p(1). Hence by Lemma C.3, it suffices to show $α^{T} Ω_{n} [P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) + D] = o_{p} (1)$ .

By Assumptions 4.5 and 4.6(i), λ_min(Γ_n)⁻¹/² = O_p(1).Thus ‖ α^TΩ_n‖ = O_p(√n). Lemma C.2 then implies $λ_{max} (Ω_{n}) ‖ P_{n}^{'} (| {\hat{β}}_{S} |) \circ sgn ({\hat{β}}_{S}) ‖$ is bounded by $O_{p} (\sqrt{n}) ({max}_{‖ β_{S} - β_{0 S} ‖ \leq d_{n} / 4} η (β) \sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n})) = o_{p} (1)$ .

It remains to prove ‖D‖ = o_p(n^−1/2), and it suffices to show that

‖ \nabla^{2} {\tilde{L}}_{FGMM} (β^{*}) - \nabla^{2} {\tilde{L}}_{FGMM} (β_{0 S}) ‖ = o_{p} ({(s log p)}^{- 1 / 2})

(C.4)

due to $‖ {\hat{β}}_{S} - β_{0 S} ‖ = O_{p} (\sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n}))$ , and Assumption 4.6 that $\sqrt{n s} P_{n}^{'} (d_{n}) = o (1)$ . Showing (C.4) is straightforward given the continuity of ∇²L̃_FGMM.

Appendix D: Proofs for Sections 5 and 6

The local minimizer in Theorem 4.1 is denoted by $\hat{β} = {({\hat{β}}_{S}^{T}, {\hat{β}}_{N}^{T})}^{T}$ and P(β̂_N) → 1. Let ${\hat{β}}_{G} = {({\hat{β}}_{S}^{T}, 0)}^{T}$ .

D.1. Proof of Theorem 5.1

Lemma D.1

L_{FGMM} ({\hat{β}}_{G}) = O_{p} (s log p / n + s P_{n}^{'} {(d_{n})}^{2})

Proof. We have, $L_{FGMM} ({\hat{β}}_{G}) \leq {‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} {\hat{β}}_{S}) V_{i S} ‖}^{2} O_{p} (1)$ . By Taylor expansion, with some β̂ in the segment joining β₀_S and β̂_S,

‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} {\hat{β}}_{S}) V_{i S} ‖ \leq ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{0 S}) V_{i S} ‖ + ‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i S}^{T} {\tilde{β}}_{S}) X_{i S} V_{i S}^{T} ‖ ‖ {\hat{β}}_{S} - β_{0 S} ‖ \leq O_{p} (\sqrt{s log p / n}) + ‖ \frac{1}{n} \sum_{i = 1}^{n} m (Y_{i}, X_{i S}^{T} β_{0 S}) X_{i S} V_{i S}^{T} ‖ ‖ {\hat{β}}_{S} - β_{0 S} ‖ + \frac{1}{n} \sum_{i = 1}^{n} | m (Y_{i}, X_{i S}^{T} {\tilde{β}}_{S}) - m (Y_{i}, X_{i S}^{T} β_{0 S}) | ‖ X_{i S} V_{i S}^{T} ‖ ‖ {\hat{β}}_{S} - β_{0 S} ‖ .

Note that $‖ E m (Y, X_{S}^{T} β_{0 S}) X_{S} V_{S} ‖$ is bounded due to Assumption 4.5. Apply Taylor expansion again, with some β̂*, the above term is bounded by

O_{p} (\sqrt{s log p / n}) + O_{p} (1) ‖ {\hat{β}}_{S} - β_{0 S} ‖ + \frac{1}{n} \sum_{i = 1}^{n} | q (Y_{i}, X_{i S}^{T} {\tilde{β}}_{S}^{*}) | ‖ X_{i S} ‖ ‖ {\tilde{β}}_{S} - β_{0 S} ‖ ‖ X_{i S} V_{i S}^{T} ‖ ‖ {\hat{β}}_{S} - β_{0 S} ‖ .

Note that sup_t₁,t₂| q(t₁, t₂)| < ∞ by Assumption 4.4. The second term in the above is bounded by $C \frac{1}{n} \sum_{i = 1}^{n} ‖ X_{i S} ‖ ‖ X_{i S} V_{i S}^{T} ‖ {‖ {\hat{β}}_{S} - β_{0 S} ‖}^{2}$ . Combining these terms, $‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} {\hat{β}}_{S}) V_{i S} ‖$ is bounded by $O_{p} (\sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n})) + O_{p} (s \sqrt{s}) {‖ {\hat{β}}_{S} - β_{0 S} ‖}^{2} = O_{p} (\sqrt{s log p / n} + \sqrt{s} P_{n}^{'} (d_{n}))$ .

Lemma D.2

Under the theorem's assumptions

Q_{FGMM} ({\hat{β}}_{G}) = O_{p} (\frac{s log p}{n} + s P_{n}^{'} {(d_{n})}^{2} + s max_{j \in S} P_{n} (| β_{0 j} |) + P_{n}^{'} (d_{n}) s \sqrt{\frac{log s}{n}}) .

Proof. By the foregoing lemma, we have

Q_{FGMM} ({\hat{β}}_{G}) = O_{p} (\frac{s log p}{n} + s P_{n}^{'} {(d_{n})}^{2}) + \sum_{j = 1}^{s} P_{n} (| β_{S j} |) .

Now, for some β̃_Sj in the segment joining β̂_Sj and β₀_j,

\sum_{j = 1}^{s} P_{n} (| β_{S j} |) \leq \sum_{j = 1}^{s} P_{n} (| β_{0 S, j} |) + \sum_{j = 1}^{s} P_{n}^{'} (| {\tilde{β}}_{S j} |) | β_{S j} - β_{0 S, j} | \leq s max_{j \in S} P_{n} (| β_{0 j} |) + \sum_{j = 1}^{s} P_{n}^{'} (d_{n}) | β_{S j} - β_{0 S, j} | \leq s max_{j \in s} P_{n} (| β_{0 j} |) + P_{n}^{'} (d_{n}) ‖ {\hat{β}}_{S} - β_{0 S} ‖ \sqrt{s} .

The result then follows.

Note that ∀δ > 0,

inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β) \geq inf_{β \notin Θ_{δ} \cup {0}} L_{FGMM} (β) \geq inf_{β \notin Θ_{δ} \cup {0}} {‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i}^{T} β) V_{i} (β) ‖}^{2} min_{j \leq p} {\hat{var} (X_{j}), \hat{var} (X_{j}^{2})} .

Hence by Assumption 5.1, there exists γ > 0,

P (inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β) > 2 γ) \to 1 .

On the other hand, by Lemma D.2, Q_FGMM(β̂_G) = o_p(1). Therefore,

P (Q_{FGMM} (\hat{β}) + γ > inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β)) P (Q_{FGMM} ({\hat{β}}_{G}) + γ > inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β)) + o (1) \leq P (Q_{FGMM} ({\hat{β}}_{G}) + γ > 2 γ) + P (inf_{β \notin Θ_{δ} \cup {0}} Q_{FGMM} (β) < 2 γ) + o (1) \leq P (Q_{FGMM} ({\hat{β}}_{G}) > γ) + o (1) = o (1) .

Q.E.D.

D.2. Proof of Theorem 6.1

Lemma D.3

Define $ρ (β_{S}) = E [g (Y, X_{S}^{T} β_{S}) σ {(W)}^{- 2} D (W)]$ . Under the theorem assumptions, sup_{β_S} _{∈ Θ} ‖ρ(β_S) − ρ_n(β_S)‖ = o_p(1).

Proof. We first show three convergence results:

sup_{β_{S} \in Θ} \frac{1}{n} \sum_{i = 1}^{n} ‖ g (Y_{i}, X_{i S}^{T} β_{S}) (D (W_{i}) - \hat{D} (W_{i})) \hat{σ} {(W_{i})}^{- 2} ‖ = o_{p} (1),

(D.1)

sup_{β_{S} \in Θ} \frac{1}{n} \sum_{i = 1}^{n} ‖ g (Y_{i}, X_{i S}^{T} β_{S}) D (W_{i}) (\hat{σ} {(W_{i})}^{- 2} - σ {(W_{i})}^{- 2}) ‖ = o_{p} (1),

(D.2)

sup_{β_{S} \in Θ} ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{S}) D (W_{i}) σ {(W_{i})}^{- 2} - E g (Y, X_{S}^{T} β_{S}) D (W) σ {(W)}^{- 2} ‖ = o_{p} (1),

(D.3)

Because both sup_w ‖D̂ (w) − D(w)‖ and sup_w |σ̂(w)² − σ(w)²| are o_p(1), proving (D.1) and (D.2) is straightforward. In addition, given the assumption that $E ({sup}_{‖ β ‖ \infty \leq M} g {(Y, X_{S}^{T} β_{S})}^{4}) < \infty$ , (D.3) follows from the uniform law of large number. Hence we have,

sup_{β_{S} \in Θ} ‖ \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, X_{i S}^{T} β_{S}) \hat{D} (W_{i}) \hat{σ} {(W_{i})}^{- 2} - E g (Y, X_{S}^{T} β_{S}) D (W) σ {(W)}^{- 2} ‖ = o_{p} (1) .

In addition, the event X_S = X̂_S occurs with probability approaching one, given the selection consistency P(Ŝ = S) → 1 achieved in Theorem 4.1.

The result then follows because $ρ_{n} (β_{S}) = \frac{1}{n} \sum_{i = 1}^{n} g (Y_{i}, {\hat{X}}_{i S}^{T} β_{S}) \hat{σ} {(W_{i})}^{- 2} \hat{D} (W_{i})$ .

Given Lemma D.3, Theorem 6.1 follows from a standard argument for the asymptotic normality of GMM estimators as in Hansen (1982) and Newey and McFadden (1994, Theorem 3.4). The asymptotic variance achieves the semi-parametric efficiency bound derived by Chamberlain (1987) and Severini and Tripathi (2001). Therefore, β̂* is semi-parametric efficient.

Appendix E: Proofs for Section 7

The proof of Theorem 7.1 is very similar to that of Theorem 4.1, which we leave to the online supplementary material, downloadable from http://terpconnect.umd.edu/∼yuanliao/high/supp.pdf

Proof of Theorem 7.2

Proof. Define $Q_{l, k} = L_{K} (β_{(- k)}^{(l)}, β_{k}^{(l)}) + \sum_{j \leq k} P_{n} (| β_{j}^{(l)} |) + \sum_{j > k} P_{n} (| β_{j}^{(l - 1)} |)$ . We first show Q_l,k ≤ Q_l,k₋₁ for 1 < k ≤ p and Q_l+1,1 ≤ Q_l,p. For 1 < k ≤ p, Q_l,k − Q_l,k₋₁ equals

L_{K} (β_{(- k)}^{(l)}, β_{k}^{(l)}) + P_{n} (| β_{k}^{(l)} |) - [L_{K} (β_{(- (k - 1))}^{(l)}, β_{k - 1}^{(l)}) + P_{n} (| β_{k}^{(l - 1)} |)] .

Note that the difference between $(β_{(- k)}^{(l)}, β_{k}^{(l)})$ and $(β_{(- (k - 1))}^{(l)}, β_{k - 1}^{(l)})$ only lies on the kth position. The kth position of $(β_{(- k)}^{(l)}, β_{k}^{(l)})$ is $β_{k}^{(l)}$ while that of $(β_{(- (k - 1))}^{(l)}, β_{k - 1}^{(l)})$ is $β_{k}^{(l - 1)}$ . Hence by the updating criterion, Q_l,k ≤ Q_l,k₋₁ for k ≤ p.

Because $(β_{(- 1)}^{(l + 1)}, β_{1}^{(l + 1)})$ is the first update in the l + 1th iteration, $(β_{(- 1)}^{(l + 1)}, β_{1}^{(l + 1)}) = (β_{(- 1)}^{(l)}, β_{1}^{(l + 1)})$ . Hence

Q_{l + 1, 1} = L_{K} (β_{(- 1)}^{(l)}, β_{1}^{(l + 1)}) + P_{n} (| β_{1}^{(l + 1)} |) + \sum_{j > 1} P_{n} (| β_{j}^{(l)} |) .

On the other hand, for $β^{(l)} = (β_{(- p)}^{(l)}, β_{p}^{(l)})$ ,

Q_{l, p} = L_{K} (β^{(l)}) + \sum_{j > 1} P_{n} (| β_{j}^{(l)} |) + P_{n} (| β_{1}^{(l)} |) .

Hence $Q_{l + 1, 1} - Q_{l, p} = L_{K} (β_{(- 1)}^{(l)}, β_{1}^{(l + 1)}) + P_{n} (| β_{1}^{(l + 1)} |) - [L_{K} (β^{(l)}) + P_{n} (| β_{1}^{(l)} |)]$ . Note that $(β_{(- 1)}^{(l)}, β_{1}^{(l + 1)})$ differs β^(l) only on the first position. By the updating criterion, Q_l+1,1 − Q_l,p ≤ 0.

Therefore, if we define {L_m}_m_≥1 = {Q_1,1, …,Q₁,_p, Q_2,1, …,Q₂,_p, …}, then we have shown that {L_m}_m_≥1 is a non-increasing sequence. In addition, L_m ≥ 0 for all m ≥ 1. Hence L_m is a bounded convergent sequence, which also implies that it is Cauchy. By the definition of Q_K(β^{(^l)}), we have Q_K(β^{(^l)}) = Q_l,p, and thus {Q_K(β^{(^l)})}_l_≥1 is a sub-sequence of {L_m}. Hence it is also bounded Cauchy. Therefore, for any ε > 0, there is N > 0, when l₁, l₂ > N, | Q_K(β^(l₁)) − Q_K(β^(l₂))| < ε, which implies that the iterations will stop after finite steps.

The rest of the proof is similar to that of the Lyapunov's theorem of Lange (1995, Prop. 4). Consider a limit point β* of {β^{(^l)}} _l_≥1 such that there is a subsequence lim_k_→∞ β^{(^lk)} = β*. Because both Q_K(·) and M(·) are continuous, and Q_K(β^{(^l)}) is a Cauchy sequence, taking limits yields

Q_{K} (M (β^{*})) = lim_{k \to \infty} Q_{K} (M (β^{(l_{k})})) = lim_{k \to \infty} Q_{K} (β^{(l_{k})}) = Q_{K} (β^{*}) .

Hence β* is a stationary point of Q_K(β).

Footnotes

In fixed designs, e.g., Zhao and Yu (2006), it has been implicitly assumed that $n^{- 1} \sum_{i = 1}^{n} ε_{i} X_{i j} = o_{p} (1)$ for all j < p.

We thank the AE and referees for suggesting the use of a general vector of instrument W, which extends to the more general endogeneity problem, allowing the presence of endogenous important regressors. In particular, W is allowed to be X_S, which amounts to assume that E(ε|X_S) = 0 by (1.4), but allow E(ε|X) ≠ 0. In this case, we can allow the instruments W = X_S to be unknown, and F and H to be defined below can be transformations of X. This is the setup of an earlier version of this paper, which is much weaker than (1.2) and allows some of X_N to be endogenous.

The compatibility of (1.6) requires very stringent conditions. If $E F_{\tilde{S}} X_{\tilde{S}}^{T}$ are invertible, then a necessary condition for (1.6) to have a common solution is that ${(E F_{\tilde{S}} X_{\tilde{S}}^{T})}^{- 1} E (Y F_{\tilde{S}}) = {(E H_{\tilde{S}} X_{\tilde{S}}^{T})}^{- 1} E (Y H_{\tilde{S}})$ , which does not hold in general when F ≠ H.

⁴

For technical reasons we use a diagonal weight matrix and it is likely non-optimal. However, it does not affect the variable selection consistency in this step.

⁵

We thank a referee for reminding us this important research direction.

Contributor Information

Jianqing Fan, Email: jqfan@princeton.edu, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544.

Yuan Liao, Email: yuanliao@umd.edu, Department of Mathematics, University of Maryland, College Park, MD 20742.

References

Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
Andrews D. Consistent moment selection procedures for generalized method of moments estimation. Econometrica. 1999;67:543–564. [Google Scholar]
Andrews D, Lu B. Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models. J Econometrics. 2001;101:123–164. [Google Scholar]
Antoniadis A. Smoothing noisy data with tapered coiflets series. Scand J Stat. 1996;23:313–330. [Google Scholar]
Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]
Belloni A, Chernozhukov V. Least squares after model selection in high-dimensional sparse models. Bernoulli. 2013;19:521–547. [Google Scholar]
Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection amongst high-dimensional controls. Review of Economic Studies 2013 Forthcoming. [Google Scholar]
Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Springer; New York: 1998. [Google Scholar]
Bickel P, Ritov Y, Tsybakov R. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]
Bondell H, Reich B. Consistent high-dimensional Bayesian variable selection via penalized credible regions. J Amer Statist Assoc. 2012;107:1610–1624. doi: 10.1080/01621459.2012.716344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. J R Stat Soc Ser B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breheny P, Huang J. Coordinate descent algorithms for non convex penalized regression, with applications to biological feature selection. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional models: partially faithful distributions and the PC-simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]
Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; New York: 2011. [Google Scholar]
Canay I, Santos A, Shaikh A. On the testability of identification in some nonparametric odes with endogeneity. Econometrica. 2013;81:2535–2559. [Google Scholar]
Caner M. Lasso-type GMM estimator. Econometric Theory. 2009;25:270–290. [Google Scholar]
Caner M, Fan Q. Hybrid generalized empirical likelihood estimators: instrument selection with adaptive lasso. Manuscript 2012 [Google Scholar]
Caner M, Zhang H. Adaptive elastic net GMM with diverging number of moments. Journal of Business and Economic Statistics. 2013 doi: 10.1080/07350015.2013.836104. forthcoming. [DOI] [PMC free article] [PubMed] [Google Scholar]
Candès E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist. 2007;35:2313–2404. [Google Scholar]
Chamberlain G. Asymptotic efficiency in estimation with conditional moment restrictions. J Econometrics. 1987;34:305–334. [Google Scholar]
Chen X. Large sample sieve estimation of semi-nonparametric models. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics. VI ch 76 2007. [Google Scholar]
Chen X, Pouzo D. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica. 2012;80:277–321. [Google Scholar]
Chernozhukov V, Hong H. An MCMC approach to classical estimation. J Econometrics. 2003;115:293–346. [Google Scholar]
Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math. 2004;57:1413–1457. [Google Scholar]
Dominguez M, Lobato I. Consistent estimation of models defined by conditional moment restrictions. Econometrica. 2004;72:1601–1615. [Google Scholar]
Donald S, Imbens G, Newey W. Choosing instrumental variables in conditional moment restriction models. J Econometrics. 2009;153:28–36. [Google Scholar]
Engle R, Hendry D, Richard J. Exogeneity. Econometrica. 1983;51:277–304. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Liao Y. Endogeity in ultra high dimensions. Manuscript 2012 [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Yao Q. Efficient estimation of conditional variance functions in stochastic regression. Biometrika. 1998;85:645–660. [Google Scholar]
Fu W. Penalized regression: The bridge versus the LASSO. J Comput Graph Statist. 1998;7:397–416. [Google Scholar]
García E. Linear regression with a large number of weak instruments using a post-l1-penalized estimator. Manuscript 2011 [Google Scholar]
Gautier E, Tsybakov A. High dimensional instrumental variables regression and confidence sets. Manuscript 2011 [Google Scholar]
van de Geer S. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. doi: 10.1214/18-AOS1761. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P, Horowitz J. Nonparametric methods for inference in the presence of instrumental variables. Ann Statist. 2005;33:2904–2929. [Google Scholar]
Hansen L. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]
Horowitz J. A smoothed maximum score estimator for the binary response model. Econometrica. 1992;60:505–531. [Google Scholar]
Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
Huang J, Ma S, Zhang C. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Hunter D, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim Y, Choi H, Oh H. Smoothly Clipped Absolute Deviation on High Dimensions. J Amer Statist Assoc. 2008;103:1665–1673. [Google Scholar]
Lange K. A gradient algorithm locally equivalent to the EM algorithm. J Roy Statist Soc Ser B. 1995;57:425–437. [Google Scholar]
Kitamura Y, Tripathi G, Ahn H. Empirical likelihood-based inference in conditional moment restriction models. Econometrica. 2004;72:1667–1714. [Google Scholar]
Leeb H, Pötscher B. Sparse estimators and the oracle property, or the return of Hodges' estimator. J Econometrics. 2008;142:201–211. [Google Scholar]
Liao Z. Adaptive GMM shrinkage estimation with consistent moment selection. Econometric Theory. 2013;29:857–904. [Google Scholar]
Loh P, Wainwright M. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Manuscript 2013 [Google Scholar]
Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]
Newey W. Semiparametric efficiency bound. J Appl Econometrics. 1990;5:99–125. [Google Scholar]
Newey W. Efficient estimation of models with conditional moment restrictions. In: Maddala GS, Rao CR, Vinod HD, editors. Handbook of Statistics, Volume 11: Econometrics. Amsterdam; North-Holland: 1993. [Google Scholar]
Newey W, McFadden D. Large sample estimation and hypothesis testing. In: Engle R, McFadden D, editors. Handbook of Econometrics. Chapter 36. 1994. [Google Scholar]
Newey W, Powell J. Instrumental variable estimation of nonpara-metric models. Econometrica. 2003;71:1565–1578. [Google Scholar]
Städler N, Bühlmann P, van de Geer S. l1-penalization for mixture regression models with discussion. Test. 2010;19:209–256. [Google Scholar]
Severini T, Tripathi G. A simplified approach to computing efficiency bounds in semiparametric models. J Econometrics. 2001;102:23–66. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]
Wasserman L, Roeder K. High-dimensional variable selection. Ann Statist. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]
Zhang C, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear models. Ann Statist. 2008;36:1567–1594. [Google Scholar]
Zhang C, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
Zhao P, Yu B. On model selection consistency of Lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl

NIHMS649193-supplement-suppl.pdf^{(200.3KB, pdf)}

[R1] Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]

[R2] Andrews D. Consistent moment selection procedures for generalized method of moments estimation. Econometrica. 1999;67:543–564. [Google Scholar]

[R3] Andrews D, Lu B. Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models. J Econometrics. 2001;101:123–164. [Google Scholar]

[R4] Antoniadis A. Smoothing noisy data with tapered coiflets series. Scand J Stat. 1996;23:313–330. [Google Scholar]

[R5] Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]

[R6] Belloni A, Chernozhukov V. Least squares after model selection in high-dimensional sparse models. Bernoulli. 2013;19:521–547. [Google Scholar]

[R7] Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection amongst high-dimensional controls. Review of Economic Studies 2013 Forthcoming. [Google Scholar]

[R8] Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Springer; New York: 1998. [Google Scholar]

[R9] Bickel P, Ritov Y, Tsybakov R. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37:1705–1732. [Google Scholar]

[R10] Bondell H, Reich B. Consistent high-dimensional Bayesian variable selection via penalized credible regions. J Amer Statist Assoc. 2012;107:1610–1624. doi: 10.1080/01621459.2012.716344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Bradic J, Fan J, Wang W. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. J R Stat Soc Ser B. 2011;73:325–349. doi: 10.1111/j.1467-9868.2010.00764.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Breheny P, Huang J. Coordinate descent algorithms for non convex penalized regression, with applications to biological feature selection. Ann Appl Statist. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional models: partially faithful distributions and the PC-simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]

[R14] Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; New York: 2011. [Google Scholar]

[R15] Canay I, Santos A, Shaikh A. On the testability of identification in some nonparametric odes with endogeneity. Econometrica. 2013;81:2535–2559. [Google Scholar]

[R16] Caner M. Lasso-type GMM estimator. Econometric Theory. 2009;25:270–290. [Google Scholar]

[R17] Caner M, Fan Q. Hybrid generalized empirical likelihood estimators: instrument selection with adaptive lasso. Manuscript 2012 [Google Scholar]

[R18] Caner M, Zhang H. Adaptive elastic net GMM with diverging number of moments. Journal of Business and Economic Statistics. 2013 doi: 10.1080/07350015.2013.836104. forthcoming. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Candès E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Statist. 2007;35:2313–2404. [Google Scholar]

[R20] Chamberlain G. Asymptotic efficiency in estimation with conditional moment restrictions. J Econometrics. 1987;34:305–334. [Google Scholar]

[R21] Chen X. Large sample sieve estimation of semi-nonparametric models. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics. VI ch 76 2007. [Google Scholar]

[R22] Chen X, Pouzo D. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica. 2012;80:277–321. [Google Scholar]

[R23] Chernozhukov V, Hong H. An MCMC approach to classical estimation. J Econometrics. 2003;115:293–346. [Google Scholar]

[R24] Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm Pure Appl Math. 2004;57:1413–1457. [Google Scholar]

[R25] Dominguez M, Lobato I. Consistent estimation of models defined by conditional moment restrictions. Econometrica. 2004;72:1601–1615. [Google Scholar]

[R26] Donald S, Imbens G, Newey W. Choosing instrumental variables in conditional moment restriction models. J Econometrics. 2009;153:28–36. [Google Scholar]

[R27] Engle R, Hendry D, Richard J. Exogeneity. Econometrica. 1983;51:277–304. [Google Scholar]

[R28] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R29] Fan J, Liao Y. Endogeity in ultra high dimensions. Manuscript 2012 [Google Scholar]

[R30] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Fan J, Lv J. Non-concave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Fan J, Yao Q. Efficient estimation of conditional variance functions in stochastic regression. Biometrika. 1998;85:645–660. [Google Scholar]

[R33] Fu W. Penalized regression: The bridge versus the LASSO. J Comput Graph Statist. 1998;7:397–416. [Google Scholar]

[R34] García E. Linear regression with a large number of weak instruments using a post-l1-penalized estimator. Manuscript 2011 [Google Scholar]

[R35] Gautier E, Tsybakov A. High dimensional instrumental variables regression and confidence sets. Manuscript 2011 [Google Scholar]

[R36] van de Geer S. High-dimensional generalized linear models and the lasso. Annals of Statistics. 2008;36:614–645. doi: 10.1214/18-AOS1761. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Hall P, Horowitz J. Nonparametric methods for inference in the presence of instrumental variables. Ann Statist. 2005;33:2904–2929. [Google Scholar]

[R38] Hansen L. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]

[R39] Horowitz J. A smoothed maximum score estimator for the binary response model. Econometrica. 1992;60:505–531. [Google Scholar]

[R40] Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]

[R41] Huang J, Ma S, Zhang C. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R42] Hunter D, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Kim Y, Choi H, Oh H. Smoothly Clipped Absolute Deviation on High Dimensions. J Amer Statist Assoc. 2008;103:1665–1673. [Google Scholar]

[R44] Lange K. A gradient algorithm locally equivalent to the EM algorithm. J Roy Statist Soc Ser B. 1995;57:425–437. [Google Scholar]

[R45] Kitamura Y, Tripathi G, Ahn H. Empirical likelihood-based inference in conditional moment restriction models. Econometrica. 2004;72:1667–1714. [Google Scholar]

[R46] Leeb H, Pötscher B. Sparse estimators and the oracle property, or the return of Hodges' estimator. J Econometrics. 2008;142:201–211. [Google Scholar]

[R47] Liao Z. Adaptive GMM shrinkage estimation with consistent moment selection. Econometric Theory. 2013;29:857–904. [Google Scholar]

[R48] Loh P, Wainwright M. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Manuscript 2013 [Google Scholar]

[R49] Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann Statist. 2009;37:3498–3528. [Google Scholar]

[R50] Newey W. Semiparametric efficiency bound. J Appl Econometrics. 1990;5:99–125. [Google Scholar]

[R51] Newey W. Efficient estimation of models with conditional moment restrictions. In: Maddala GS, Rao CR, Vinod HD, editors. Handbook of Statistics, Volume 11: Econometrics. Amsterdam; North-Holland: 1993. [Google Scholar]

[R52] Newey W, McFadden D. Large sample estimation and hypothesis testing. In: Engle R, McFadden D, editors. Handbook of Econometrics. Chapter 36. 1994. [Google Scholar]

[R53] Newey W, Powell J. Instrumental variable estimation of nonpara-metric models. Econometrica. 2003;71:1565–1578. [Google Scholar]

[R54] Städler N, Bühlmann P, van de Geer S. l1-penalization for mixture regression models with discussion. Test. 2010;19:209–256. [Google Scholar]

[R55] Severini T, Tripathi G. A simplified approach to computing efficiency bounds in semiparametric models. J Econometrics. 2001;102:23–66. [Google Scholar]

[R56] Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B. 1996;58:267–288. [Google Scholar]

[R57] Wasserman L, Roeder K. High-dimensional variable selection. Ann Statist. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38:894–942. [Google Scholar]

[R59] Zhang C, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear models. Ann Statist. 2008;36:1567–1594. [Google Scholar]

[R60] Zhang C, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]

[R61] Zhao P, Yu B. On model selection consistency of Lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]

[R62] Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R63] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Endogeneity in High Dimensions

Jianqing Fan

Yuan Liao

Abstract

1. Introduction

Notation

2. Necessary Condition for Variable Selection Consistency

2.1. Penalized regression and necessary condition

Theorem 2.1 (Necessary Condition)

2.2. Inconsistency of least squares with endogeneity

Theorem 2.2 (Inconsistency of PLS)

Table 1.

3. Focused GMM

3.1. Definition

Remark 3.1

3.2. Rationales behind the construction of FGMM

3.2.1. Inclusion of V(β)

3.2.2. Indicator function

4. Oracle Property of FGMM

Assumption 4.1

Assumption 4.2

Remark 4.1

Assumption 4.3

Assumption 4.4

Assumption 4.5

Assumption 4.6

Theorem 4.1

Remark 4.2

Remark 4.3

Remark 4.4

5. Global minimization

Assumption 5.1 (over-identification)

Example 5.1

Theorem 5.1

Remark 5.1

Remark 5.2

6. Semi-parametric efficiency

Assumption 6.1

Example 6.1 (Homoskedasticity)

Example 6.2 (Simultaneous linear equations)

Example 6.3 (Semi-nonparametric estimation)

Theorem 6.1

7. Implementation

7.1. Smoothed FGMM

Fig 1. K(t2hn)=exp(t2/hn)−1exp(t2/hn)+1 as an approximation to I(t≠0).

Theorem 7.1

7.2. Coordinate descent algorithm

Theorem 7.2

8. Monte Carlo Experiments

8.1. Endogeneity in both important and unimportant regressors

Table 2. Endogeneity in both important and unimportant regressors, n = 100.

8.2. Endogeneity only in unimportant regressors

Table 3. Endogeneity only in unimportant regressors, n = 200.

8.3. Weak minimal signals

Table 4. FGMM for weak minimal signal β4 = −0.5, β5 = 0.1.

9. Conclusion and Discussion

Supplementary Material

Acknowledgments

Appendix A: Proofs for Section 2

A.1. Proof of Theorem 2.1

A.2. Proof of Theorem 2.2

Appendix B: General Penalized Regressions

Lemma B.1

Theorem B.1 (Oracle Consistency)

Theorem B.2(Variable selection)

Appendix C: Proofs for Section 4

Lemma C.1

C.1. Verifying conditions in Theorems B.1, B.2

C.1.1. Verifying conditions in Theorem B.1

Condition (i)

Condition (ii)

Condition (iii)

C.1.2. Verifying conditions in Theorem B.2

C.2. Proof of Theorem 4.1: parts (ii) (iii)

C.3. Proof of Theorem 4.1: part (i)

Lemma C.2

Lemma C.3

Proof of Theorem 4.1: part (i)

Appendix D: Proofs for Sections 5 and 6

Fig 1. $K (\frac{t^{2}}{h_{n}}) = \frac{exp (t^{2} / h_{n}) - 1}{exp (t^{2} / h_{n}) + 1}$ as an approximation to I_(t≠0).

Table 4. FGMM for weak minimal signal β₄ = −0.5, β₅ = 0.1.