Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jul 30.
Published in final edited form as: J Bus Econ Stat. 2013 Sep 6;32(1):30–47. doi: 10.1080/07350015.2013.836104

Adaptive Elastic Net for Generalized Methods of Moments

Mehmet Caner 1,*, Hao Helen Zhang 2
PMCID: PMC3932067  NIHMSID: NIHMS533707  PMID: 24570579

Abstract

Model selection and estimation are crucial parts of econometrics. This paper introduces a new technique that can simultaneously estimate and select the model in generalized method of moments (GMM) context. The GMM is particularly powerful for analyzing complex data sets such as longitudinal and panel data, and it has wide applications in econometrics. This paper extends the least squares based adaptive elastic net estimator of Zou and Zhang (2009) to nonlinear equation systems with endogenous variables. The extension is not trivial and involves a new proof technique due to estimators lack of closed form solutions. Compared to Bridge-GMM of Caner (2009), we allow for the number of parameters to diverge to infinity as well as collinearity among a large number of variables, also the redundant parameters set to zero via a data dependent technique. This method has the oracle property, meaning that we can estimate nonzero parameters with their standard limit and the redundant parameters are dropped from the equations simultaneously. Numerical examples are used to illustrate the performance of the new method.

Keywords: Penalized estimators, Oracle property, Method of moments, GMM

1 Introduction

One of the most commonly used estimation techniques is Generalized Method of Moments (GMM) estimation. The GMM provides a unified framework for parameter estimation by encompassing many common estimation methods such as ordinary least squares (OLS), maximum likelihood estimator (MLE), and instrumental variables. We can estimate the parameters by two-step efficient Generalized Method of Moments (GMM) of Hansen (1982). The GMM is an important tool in econometrics, finance, accounting, and strategic planning literature as well. In this paper we are concerned about model selection in GMM when the number of parameters diverges. These situations can arise in labor economics, international finance (see Alfaro, Kalemli-Ozcan, Volosovych (2008)), and so on. In linear models when the some of the regressors are correlated with errors and there are a large number of covariates, the model selection tools are essential, since they can improve finite sample performance of the estimators.

Model selection techniques are very useful and widely used in statistics. For example Tibshirani (1996) propose the lasso method, Knight and Fu (2000) derive the asymptotic properties of the lasso, and Fan and Li (2001) propose the SCAD estimator. In econometrics, Knight (2008), and Caner (2009) offer Bridge-least squares and Bridge-GMM estimators respectively. But these procedures all consider finite dimensions and do not take into account the collinearity among variables. Recently model selection with a large number of parameters has been analyzed in least squares by Huang, Horowitz, and Ma (2008), and Zou and Zhang (2009), where the first article analyzes the Bridge estimator, and the second paper is concerned with the adaptive elastic net estimator.

Adaptive elastic net estimator has the oracle property when the number of parameters diverges with the sample size. Furthermore, this method can handle the collinearity arising from a large number of regressors when the system is linear with endogenous regressors. When some of the parameters are redundant (i.e. when the true model has a sparse representation), this estimator can estimate the zero parameters as exactly zero.

In this paper we extend the least squares based adaptive elastic net of Zou and Zhang (2009) to GMM. The following issues are pertinent to model selection in GMM: (i) handling a large number of control variables in the structural equation in a simultaneous equation system, or a large number of parameters in a nonlinear system with endogenous and control variables; (ii) taking into account correlation among variables; (iii) achieving selection consistency and estimation efficiency simultaneously. All of these are successfully addressed in this work. In the least squares case of Zou and Zhang (2009), they do not need an explicit consistency proof since the least squares estimator has a simple and closed form solution. However, in this paper, since the GMM estimator does not have a closed form solution, an explicit consistency proof is needed before deriving the finite sample risk bounds. This is one major contribution of this paper. Furthermore, in order to get a consistency proof, we have substantially extended the technique used in the consistency proof of Bridge least squares estimator of Huang, Horowitz and Ma (2008) to the GMM with adaptive elastic net penalty. To derive the finite sample risk bounds, we use the mean value theorem and benefit from consistency proof, unlike the least squares case of Zou and Zhang (2009). The nonlinear nature of the functions introduces additional difficulties. The GMM case involves partial derivatives of the sample moments which depend on parameter estimates. This is unlike the least squares case where the same quantity does not depend on parameter estimates. This results in the need for consistency proof as mentioned above. Also we extend Zou and Zhang (2009) to conditionally heteroscedastic data, and this results in tuning parameter for l1 norm to be larger than the one in least squares case. We also pinpoint ways to handle stationary time series cases. The estimator also has the oracle property, and the nonzero coefficients are estimated converging to a normal distribution. This is their standard limit and furthermore, the zero parameters are estimated as zero. Note that the oracle property is a pointwise criterion.

Earlier works on diverging parameters include Huber (1988) and Portnoy (1984), He and Shao (2000). In recent years, there are a few works on penalized methods for standard linear regression with diverging parameters. Fan and Peng (2004) study the nonconcave penalized likelihood with a growing number of nuisance parameters, Lam and Fan (2007) analyze the profile likelihood ratio inference with a growing number of parameters, and Huang et al. (2008) study asymptotic properties of bridge estimators in sparse linear regression models. As far as we know, this is the first paper to estimate and select the model in GMM with a diverging number of parameters. In econometrics, sieve estimation will be a natural application of shrinkage estimators. There are several articles that use sieves, Ai and Chen (2003), Newey and Powell (2003), Chen (2007), Chen and Ludvigson (2009). In these articles, we see that sieve dimension is determined by trying several possibilities or left for future work. Adaptive elastic net GMM can simultaneously determine the sieve dimension and estimate the structural parameters. We also see in unpenalized GMM with many parameters case, there is a paper by Han and Phillips (2006). Liao (2011) considers adaptive lasso with fixed number of invalid moment conditions.

Section 2 presents the model and the new estimator. Then in Section 3 we derive the asymptotic results for the proposed estimator. Section 4 conducts simulations. Section 5 provides an asset pricing example used in Chen and Ludvigson (2009). Section 6 concludes. Appendix includes all the proofs.

2 Model

Let β be a p-dimensional parameter vector, where βBp which is a compact subset in Rp. The true value of β is β0. We allow p to grow with the sample size n, so when n → ∞, we have p → ∞, but p/n → 0 as n → ∞. We do not provide a subscript of n for parameter space not to burden ourselves with the notation. The population orthogonality conditions are

E[g(Xi,β0)]=0,

where the data are {Xi : i = 1, 2 ⋯ n}, g(·) is a known function, and the number of orthogonality restrictions is q, qp. So we also allow q to grow with the sample size n, but q/n → 0 as n → ∞. From now on, we denote g(Xi, β) as gi(β) for simplicity. Also assume that gi(β) are independent, and we do not use gni(β) just to simplify the notation.

2.1 The Estimators

We first define the estimators that we use. The estimators that we are interested in aim to answer the following questions. If we have a large number of control variables, some of which may be irrelevant (we may have also a large number of endogenous variables and control variables) in the structural equation in a simultaneous equation system or a large number of parameters in a nonlinear system with endogenous and control variables, can we select the relevant ones as well as estimate the selected system simultaneously? If we have a large number of variables among which there may be possibly some correlation among the variables, can this method handle that? Is it also possible for the estimator to achieve the oracle property? The answers to all three questions are affirmative. First of all, the adaptive elastic net estimator simultaneously selects and estimates the model when there are a large number of parameters/regressors. It can also take into account the possible correlation among the variables. By achieving the oracle property, the nonzero parameters are estimated with their standard limits, and the zero ones are estimated as exactly zero. This method is computationally easy and uses data dependent methods to set small coefficient estimates to zero. A subcase of the adaptive elastic net estimator is the adaptive lasso estimator which can handle the first and third questions but does not handle correlation among a large number of variables.

First we introduce the notation: we use the following norms for the vector β1=j=1pβj, β22=j=1pβj2, also β2+l2+l=j=1pβj2+l, where l > 0 is a positive number. For a matrix A, the norm is A22=tr(AA). We start by introducing the adaptive elastic net estimator, given the positive and diverging tuning parameters λ1, λ1, λ2 (how to choose them in finite samples and its asymptotic properties will be discussed below in Assumptions and then in Simulation Section)

β^aenet=(1+λ2n){argminβBp[(i=1ngi(β))Wn(i=1ngi(β))+λ2β22+λ1j=1pw^jβj]}, (1)

where w^j=1β^enetγ,β^enet is a consistent estimator immediately explained below, γ is a positive constant, and p = nα, 0 < α < 1. The assumption on γ will be explained in detail in Assumption 3(iii). Wn is a q × q weight matrix that will be defined in Assumptions below.

The elastic net estimator, which is used in the weights of the penalty above, is

β^enet=(1+λ2n){argminβBpSn(β)},

where

Sn(β)=[i=1ngi(β)]Wn[i=1ngi(β)]+λ2β22+λ1β1, (2)

λ1, λ2 are positive and diverging sequences that will be defined in Assumption 5.

We now discuss the penalty functions in both estimators and explain why we need β^enet. The elastic net estimator has both l1 and l2 penalties. The l1 penalty is used to perform automatic variable selection, and the l2 penalty is used to improve the prediction and handles the collinearity that may arise with a large number of variables. However, the standard elastic net estimator does not provide the oracle property. It turns out that, by introducing an adaptive weight in the elastic net, we can obtain the oracle property. The adaptive weights play crucial roles, since they provide data dependent penalization.

An important point to remember is when we set λ2 = 0 in the adaptive elastic net estimator (1), we obtain the adaptive lasso estimator. This is simple and we can also get the oracle property. However, with a large number of parameters/variables which may be highly collinear, an additional ridge-type penalty as in adaptive elastic net offers estimation stability and better selection. Before the assumptions we introduce the following notations. Let the collection of nonzero parameters be the set A = {j : ∣βj0∣ ≠ 0} and denote the absolute value of the minimum of the nonzero coefficients as η = minjAβj0∣. Also the cardinality of A is pA (the number of nonzero coefficients). We now provide the main assumptions.

Assumptions

  1. Define the following q × p matrix G^n(β)=i=1ngi(β)β. Assume the following uniform law of large numbers
    supβBpG^n(β)nG(β)22P0,
    where G(β) is continuous in β and has a full column rank p. Also βBpRp, Bp is compact and all the absolute value of individual components of the vector β, are uniformly bounded by a constant a, ∣βj0∣, 0 ≤ a < ∞, j = 1, ⋯, p. Note that specifically supβBp12i=1nEgi(β)βG(β)220, defines G(β).
  2. Wn is a symmetric, positive definite matrix, and WnW22P0, where W is finite and positive definite matrix as well.

  3. (i). [n1i=1nEgi(β0)gi(β0)]1Ω1220.

    (ii). Assume p = nα, 0 < α < 1, p/n → 0, q/n → 0 as n → ∞, and p, q → ∞, qp, so q = nν, 0 < ν < 1, αν.

    (iii). The coefficient on the weights: γ satisfies the following bound γ>2+α1α.

  4. Assume that
    maxiEgi(βA,0)2+l2+lnl20,
    for l > 0, and where βA,0 represents the true values of the nonzero parameters and it is of pA dimension. The dimension also increases with the sample size; pA → ∞, as n → ∞, and 0 ≤ pAp.
  5. (i). λ1n0,λ2n0,λ1n0

    (ii).
    λ1n3+αnγ(1α), (3)
    (iii).
    n1νη2. (4)
    (iv).
    n1αηγ. (5)

    (v). Set η = O(n−m), 0 < m < α/2.

Now we provide some discussions on the above assumptions. Most of them are standard and used in other papers which establish asymptotic results for penalized estimation in the context of diverging parameters. The rest of them are typically used for the GMM to deal with nonlinear equation systems with endogenous variables. Since p → ∞, Assumption 1 can be thought of uniform convergence over sieve spaces Bp. For iid subcase, primitive conditions are available in Condition 3.5M in Chen (2007).

Assumptions 1-2 are standard in the literature of GMM (Newey and Windmeijer, 2009, Chen 2007). They are similar to Assumptions 7 and 3 in Newey and Windmeijer (2009a). It is also important to see that Assumption 2 is needed for the two-step nature of the GMM problem. In the first step we can use any consistent estimator (i.e. elastic net) and substitute in Wn=n1i=1ngi(β^enet)gi(β^enet) in the adaptive elastic net estimation, where β^enet is the elastic net estimator. Also note that with different estimators, we can define the limit weight W differently. Depending on the estimators Wn, W can change. Assumption 3 provides a definition of variance covariance matrix, and then establishes that the number of diverging parameters cannot exceed the sample size. This is also used in Zou and Zhang (2009). For the penalty exponent in the weight, our condition is more stringent than in the least squares case of Zou and Zhang (2009). This is needed for model selection for local to zero coefficients in GMM format. Assumption 4 is a primitive condition for the triangular array central limit theorem. This is also restraining the number of orthogonality conditions q.

The main issue are the tuning parameter assumptions that reduce bias. We first compare with Bridge estimator of Knight and Fu (2000), there in Theorem 3, they need λ/nα/2λ0 ≥ 0, where 0 < a < 1, and λ represented the only tuning parameter. In our Assumption 5(i), we need λ1n0,λ1n0,λ2n0, so ours can be larger than the Knight and Fu (2000) estimator, we can penalize more in our case. This is due to Bridge type of penalty, which requires less penalization in order to reduce bias and get the oracle property. Theorem 2 of Zou (2006) for the adaptive lasso in least squares assumes λ/n1/2 → 0. We think the reason that the GMM estimator requires large penalization is due to its complex model nature, since there are more elements that contribute to bias here. Theorem 1 of Gao and Huang (2010) display the same tuning analysis as Zou (2006).

The rates on λ1, λ2 are standard, but the rate of λ1 depends on α and γ. The conditions on λ1,λ1,λ2 are needed for consistency and the bounds on the moments of the estimators. We also allow for local to zero (nonzero) coefficients, but Assumptions 3 and 5, (4)(5) restrict their magnitude. This is tied to the possible number of nonzero coefficients and seeing that αν. If there are too many nonzero coefficients (α near 1), then for model selection purpose, the coefficients should slowly approach zero. If there are few nonzero coefficients, to give an example (α near 0), then the order of η should be slightly larger than n−1/2. This also confirms and extends Leeb and Pőtscher (2005) finding that local to zero coefficients should be larger than n−1/2 in order to be differentiated from zero coefficients. This is shown in Proposition A.1(2) in their paper. Our results extend that result to the diverging parameter case. Assumption 5(iii) is needed to get consistency for local to zero coefficients. Assumptions 5(iv)(v) are needed for model selection consistency of local to zero coefficients.

To be specific about implications of Assumptions 5(iii)(iv)(v) on the order of η, since η = O(n−m), then Assumption 5(iii) implies that

1ν2>m,

and Assumption 5(iv) implies that

1αγ>m.

Combining these two inequalities with Assumption 5(v), we obtain

m<min(1αγ,1ν2,α2).

Now we can see that with a large number of moments, or a large number of parameters, m may get small, so the magnitude of η should be large. To give an example take γ = 5, α = 1/3, ν = 2/3, which gives us an upper bound of m < 2/15. So in that scenario, η = O(n−2/15), to get selected as non zero. It is clear that this is much larger than n−1/2 which Leeb and Pőtscher (2005) found.

Note also that with γ>2+α1α (i.e. Assumption 3(iii)), we can assure that the conditions on λ1 in Assumptions 5(i) and (ii) are compatible with each other.

Using Assumptions 1, 2, 3(i), we can see that

0<bEigmin[(G^(β)n)Wn(G^(β)n)], (6)

and

Eigmax[(G^(β)n)Wn(G^(β)n)]B<, (7)

with probability approaching one, where β[β0,β^w), B > 0. These are obtained by Exercise 8.26b of Abadir and Magnus (2005), and Lemma A0 of Newey and Windmeijer (2009b) result (eigenvalue inequality for increasing number of dimension case). β^w is related to adaptive elastic net estimator and immediately defined below. Here Eigmin(M) and Eigmax(M) respectively represent the minimal and maximal eigenvalue of a generic matrix M.

3 Asymptotics

We define an estimator which is related to the adaptive elastic net estimator in (1) and also used in the risk bound calculations.

β^w=argminβBp[(i=1ngi(β))Wn(i=1ngi(β))+λ2β22+λ1j=1pw^jβj], (8)

The following theorem provides consistency for both the elastic net estimator and β^w.

Theorem 1

Under Assumptions 1-3, 5

(i).

β^enetβ022P0.

(ii).

β^wβ022P0.

Remark

It is clear from Theorem 1(ii) that adaptive elastic net estimator in (1) is also consistent. We should note that in Zou and Zhang (2009) where the least squares adaptive elastic net estimator is studied, there is no explicit consistency proof. This is due to using a simple linear model. However, for the GMM adaptive elastic net estimator we have the partial derivative of g(·) which depends on estimators unlike in the linear model case. Specifics are in equations (45)-(50). Therefore we need a new and different consistency proof compared with the least squares case. We need to introduce an estimator that is closely tied to the elastic net estimator above.

β^(λ2,λ1)=argminβBpSn(β), (9)

where Sn(β) is defined in (2). This is also the estimator we get when we set for all j, w^j=1 in β^w. Next, we provide bounds for our estimators. These are used then in the proofs for oracle property, and the limits of the estimators.

Theorem 2

Under Assumptions 1-3, 5

E(β^wβ022)4λ22β022+n3pB+λ12Ej=1pw^j2+o(n2)[n2b+λ2+o(n2)]2,

and

E(β^(λ2,λ1)β022)4λ22β022+n3pB+λ12p+o(n2)[n2b+λ2+o(n2)]2.

Remark

Note that the first bound is related to the estimator in (8). The second bound is related to the estimator in (9). β^w is related to the adaptive elastic net estimator in (1), and β^(λ1,λ2) is related to the estimator in (2). Even though β022=O(p), and p → ∞, the bound depends on λ22β022n20 in large samples by Assumptions 3(ii) and 5. Also λ12Ej=1pw^j2, is dominated by n4 in the denominator in large samples as seen in the proof of Theorem 3(i) below. It is clear from the last result that the elastic net estimator is converging at the rate of np.

Theorem 2 extends the least squares case of Theorem 3(i) in Zou and Zhang (2009) to the nonlinear GMM case. The risk bounds are different from their case due to the nonlinear nature of our problem. The partial derivative of the sample moment depends on parameter estimates in our case which complicates the proofs.

Write β0=(βA,0,0ppA) where βA,0 represents the vector of nonzero parameters (true values) Its dimension grows with the sample size, and 0p–A vector of ppA elements represents the zero (redundant) parameters. Let βA represent the nonzero parameters and of dimension pA × 1.

Then define

β~=argminβ{[i=1ngi(βA)]Wn[i=1ngi(βA)]+λ2jAβj2+λ1jAw^jβj},

where A = {j : βj0 ≠ 0, j = 1, 2, ⋯, p}. Our next goal is to show that, with probability one, [(1+λ2n)β~,0ppA] converges to the solution of the adaptive elastic net estimator in (1).

Theorem 3

Given Assumptions 1-3, and 5,

(i). with probability tending to one, ((1+λ2n)β~,0) is the solution to (1).

(ii). (Consistency in Selection) we also have

P({j:β^aenet,j0}=A)1.

Remarks

1. Theorem 3(i) shows that ideal estimator β~ becomes the same as the adaptive elastic net estimator in large samples. So the GMM elastic adaptive net estimator has the same solution as ((1+λ2n)β~,0ppA). Theorem 3(ii) shows that the nonzero adaptive elastic net estimates display the oracle property with Theorem 4 below. This is a sharper result than the one in Theorem 3(i). This is an important extension of the least squares case of Theorems 3.2, 3.3 of Zou and Zhang (2009) to the GMM estimation.

2. We allow for local to zero parameters and also provide an assumption when they may be considered as nonzero. This is Assumption 5 (iii)-(iv), n1–νη2 → ∞, n1–αηγ → ∞ where q = nν, p = nα, 0 < αν 1. The implications of the assumptions on the magnitude of the smallest nonzero coefficient is discussed after Assumptions. The proof of Theorem 3(ii) clearly shows that, as long as Assumption 5 is satisfied, the model selection for local to zero coefficients is possible. However, the local to zero coefficients cannot be arbitrarily close to zero in order to be selected. This is well established in Leeb and Pőtscher (2005). Leeb and Pőtscher (2005) show, in their Proposition A1(2), as long as the order of local to zero coefficients is larger than n−1/2 in magnitude they can be selected. So this is like a lower bound for nonzeros to be selected as zeros. Our Assumption 5 is the extension of their result to the GMM estimator with a diverging number of parameters. In the diverging parameter case, there is a tradeoff between the number of local to zero coefficients and the requirement on their order of the magnitude.

Now we provide the limit law for the estimates of the nonzero parameter values (true values). Denote the adaptive elastic net estimators that correspond to nonzero true parameter values as the vector β^aenet,A, which is of dimension pA × 1. Define a consistent variance estimator for nonzero parameters that can be derived from elastic net estimators as: Ω^. We also define Ω1 as [n1i=1nEgi(βA,0)gi(βA,0)]1Ω1220.

Theorem 4

Under Assumptions 1-5, given Wn=Ω^1, set W=Ω1

δKn[G^(β^aenet,A)Ω^1G^(β^aenet,A)]12n12(β^aenet,AβA,0)dN(0,1),

where Kn=[I+λ2(G^(β^aenet,A)Ω^1G^(β^aenet,A))11+λ2n] is a square matrix of dimension pA and δ is a vector of Euclidean norm 1.

Remarks

1. First we see that

KnIpA22P0,

due to Assumptions 1, 2, and λ2 = o(n).

2. This theorem clearly extends Zou and Zhang (2009) from the least squares case to the GMM estimation. This result generalizes theirs to nonlinear functions of endogenous variables which are heavily used in econometrics and finance. The extension is not straightforward, since the new limit result depends on an explicit separate consistency proof unlike the least squares case of Zou and Zhang (2009). This is mainly because the partial derivative of the sample moment function depends on the parameter estimates, which is not shared by the least squares estimator. The limit that we derive also corresponds to the standard GMM limit in Hansen (1982), where the same result was derived for a fixed number of parameters with a well specified model. In this way, Theorem 4 also generalizes the result of Hansen (1982) to the direction of a large number of parameters with model selection.

3. Note that Kn term is a ridge regression like term which helps to handle the collinearity among the variables.

4. Note that if we set λ2 = 0, we obtain the limit for adaptive Lasso GMM estimator. In that case Kn = IpA, and

δ(G^(β^alasso,A)Ω^1G^(β^alasso,A))12n12(β^alasso,AβA,0)dN(0,1),

There will be discussions on how to choose the tuning parameters λ1,λ2,λ1, and how to set the small parameter estimates to zero in finite samples in the simulation section.

5. Instead of Liapounov Central Limit Theorem, we can use Central Limit Theorem for stationary time series data. These already exist in Davidson (1994). Theorem 4 will proceed as before in the independent data case. When we define the GMM objective function, use sample moments as weighted in time. We conjecture that this result in same proofs for Theorems 1-3. This technique of weighting sample moments by time is used in Otsu (2006) and Guggenberger and Smith (2008).

6. After obtaining the adaptive elastic net GMM results, one can run the unpenalized GMM with nonzero parameters and conduct inference.

7. First from Remark 1, we have KnIpA22=op(1). Then δ2=(δ12++δpA2)12=1, and δ is a pA vector. And then by Assumption 1 and consistency of the adaptive elastic net we have G^(β^aenet,A)2=Op(n12). These provide the rate of npA for the adaptive elastic net estimators.

4 Simulation

In this section we analyze the finite sample properties of the adaptive elastic net estimator for GMM. Namely, we evaluate its bias, the root mean squared error as well as the correct number of redundant versus relevant parameters. We have the following simultaneous equations for all i = 1, ⋯, n

yi=xiβ0+i,xi=ziπ+ηi,i=ριηi+1ρ2ιvi,

where the number of instruments q is set to be equal to the number of parameters p, xi is a p × 1 vector, zi is a p × 1 vector, ρ = 0.5, and π is a square matrix of dimension p. Furthermore, ηi is iid N(0, Ip), vi is iid with N(0, Ip), and Ι is a p × 1 vector of ones.

The model that is estimated:

Ezii=0,

for all i = 1, ⋯, n.

We have two different designs for the parameter vector β0. In the first case β0 = {3, 3, 0, 0, 0} (Design 1), and in the second one β0 = {3, 3, 3, 3, 0} (Design 2). We have n = 100, and ziN(0, Ωz) for all i = 1, ⋯, n, and

Ωz=[10.50000.510000010000001].

So there is correlation between zi’s and this affects the correlation between xi’s since two equations are correlated. In this section, we compare three methods: GMM-BIC of Andrews and Lu (2001), Bridge-GMM of Caner (2009), and the adaptive elastic net GMM. We use four different measures to compare them here. First, we look at the percentage of correct models selected. Then we evaluate the following summary Mean Squared Error (MSE)

E[(β^β0)Σ(β^β0)], (10)

where Σ=Eii, and β^ represents the estimated coefficient vector given by three different methods. This measure is commonly used in statistics literature; see Zou and Zhang (2009). The other two measures are concerned about individual coefficients. First, the bias of each individual coefficient estimate is measured. Then the root mean squared error of each coefficient is computed. We use 10,000 iterations.

Truncation of small coefficient estimates is set to zero via β^Bridge<2λ for Bridge-GMM as suggested in Caner (2009). For the adaptive elastic net, we use the modified shooting algorithm given in Appendix 2 of Zhang and Lu (2007). Least Angle Regression (LAR) is not used because it is not clear whether it is useful in the GMM context.

This modified shooting algorithm amounts to using Kuhn-Tucker conditions for a corner solution. First, the absolute value of the partial derivative of the GMM objective (unpenalized one) with respect to the parameter of interest is evaluated at zero for that parameter, and for the rest at the current adaptive elastic net estimates. If this is less than λ1β^enet4.5, then we set that parameter to zero. We have also tried slightly larger exponents than 4.5, and observed that the results are not affected much. Note that the reason for a large γ comes from Assumption 3(iii). This is similar to the adaptive lasso case used in Zhang and Lu (2007).

The choice of λ’s in both Bridge-GMM and the adaptive elastic net GMM is done via BIC. This is suggested by Zou, Hastie, and Tibshirani (2007), and as well as by Wang, Li, and Tsai (2007). Specifically, we use the following BIC from Wang, Li, Leng (2009). For each pair of λs=(λ1,λ2)Λ

BIC(λs)=log(SSE)+Alognn,

where ∣A∣ is the cardinality of the set A. SSE=[n1i=1ngi(β^)]Wn[n1i=1ngi(β^)] Basically, given a specific λs, we analyze how many nonzero coefficients are in the estimator and use this to calculate the cardinality of A, and for that choice compute SSE. The final λ is chosen as

λ^s=argminλsΛBIC(λs),

where Λ represents a finite number of possible values of λs.

The Bridge-GMM estimator in Caner (2009) is β^ that minimizes Un(β), where

Un(β)=[i=1ngi(β)]Wn[i=1ngi(β)]+λj=1pβjγ, (11)

for a given positive regularization parameter λ and 0 < γ < 1.

We now describe the model selection by GMM-BIC proposed in Andrews and Lu (2001). Let bRp denote a model selection vector. By definition, each element of b is either zero or one. If the jth element of b is one, the corresponding βj is to be estimated; if the jth element of b is zero we set βj to be zero. We set ∣b∣ as the number of parameters to be estimated, or in the equivalent form, b=j=1pbj. We then set β[b] as the p × 1 vector representing the element by the element (Hadamard) product of β and b. The model selection will be based on the GMM objective function and a penalty term. The objective function in BIC benefits from:

Jn(b)=[i=1ngi(β[b])]Wn[i=1ngi(β[b])], (12)

where in the simulation

gi(β[b])=zi(yixiβ[b]).

The model selection vectors “b” in our case represent 31 different possibilities (excluding the all-zero case). The following are the possibilities for all “b” vectors

M=[M11,M12,M13,M14,M15],

where M11 is the identity matrix of dimension 5, I5, which represents all the possibilities with only one nonzero coefficient. M12 represents all the possibilities with two nonzero coefficients,

M12=(11110000001000111000010010011000100101010001001011). (13)

In the same way, M13 represents all possibilities with 3 nonzero coefficients, M14 represents all the possibilities with four nonzero coefficients, and M15 is the vector of ones, showing all nonzero coefficients. The true model in Design 1 is the first column vector in M12. For design 2, the true model is in M14 and that is (1, 1, 1, 1, 0)’.

The GMM-BIC selects the model based on minimizing the following criterion among the 31 possibilities

Jn(b)+blog(n). (14)

The penalty term penalizes larger models more. Denote the optimal model selection vector by b*. After selecting the optimal model in (14), the vector b*, we then estimate the model parameters by the GMM. Next we present the results on Tables 1-4 for these three techniques that are examined in the simulation section. In Table 1, we provide the correct model selection percentages for Designs 1 and 2. We see that both Bridge and Adaptive Elastic Net are doing very well. The Bridge-GMM selects the correct model 100%, and the Adaptive Elastic Net 91 – 95% of the time, whereas the GMM-BIC selects only 0 – 6.9%. This is due to lots of possibilities in the case of GMM-BIC, and with a large number of parameters the performance of GMM-BIC tends to deteriorate. Table 2 provides a summary of MSE measure results. This clearly shows that the Adaptive Elastic Net estimator is the best among the three, since its MSE figures are the smallest. The GMM-BIC is much worse in terms of MSE, due to its wrong model selection, and after the model selection estimating the zero coefficients with nonzero and large magnitudes. Tables 3 and 4 provide the bias and root mean squared error (MSE) of each coefficient in Designs 1 and 2. Comparing the Bridge with the Adaptive Elastic Net, we observe that the bias of the nonzero coefficients are generally smaller for the Adaptive Elastic Net. The same is true for generally in the case of root mean squared errors, which are smaller for the nonzero coefficients in the Adaptive Elastic Net estimator.

Table 1.

Success Percentages of Selecting the Correct Model

Estimators Design 1 Design 2

Adaptive Elastic Net 91.2 94.9
Bridge-GMM 100.0 100.0
GMM-BIC 6.9 0.0

Note: The GMM-BIC (Andrews and Lu, 2001) represents the models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied in Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

Table 4.

Bias, RMSE of Design 2

Adaptive Elastic Net Bridge-GMM GMM-BIC

BIAS RMSE BIAS RMSE BIAS RMSE

β 1 −0.181 0.193 −0.112 0.124 −0.805 158.171
β 2 −0.181 0.193 −0.662 0.665 −0.326 112.970
β 3 0.010 0.061 0.157 0.166 −0.314 120.358
β 4 −0.038 0.071 0.337 0.341 −6.759 659.673
β 5 −0.001 0.007 0.000 0.000 7.740 617.509

Note: The GMM-BIC (Andrews and Lu, 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied in Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

Table 2.

Summary Mean Squared Error (MSE)

Estimators Design 1 Design 2

Adaptive Elastic Net 1.8 1.3
Bridge-GMM 4.2 1.3
GMM-BIC 165848.5 876080.2

Note: The MSE formula is given in (10). Instead of expectations, the average of iterations are used. A small number for summary MSE is desirable for a model. The GMM-BIC (Andrews and Lu, 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied in Caner (2009). The Adaptive Elastic Net is the new procedure proposed in this study.

Table 3.

Bias and RMSE Results of Design 1

Adaptive Elastic Net Bridge-GMM GMM-BIC

BIAS RMSE BIAS RMSE BIAS RMSE

β 1 −0.244 0.272 −0.117 0.126 2.903 159.85
β 2 −0.244 0.272 −0.667 0.669 −4.082 261.32
β 3 0.013 0.042 0.000 0.000 −0.859 158.839
β 4 0.000 0.009 0.000 0.000 0.612 188.510
β 5 0.013 0.041 0.000 0.000 1.162 62.240

Note: The GMM-BIC (Andrews and Lu, 2001) represents models that are selected according to BIC and subsequently we use GMM. The Bridge-GMM estimator is studied in Caner (2009). The Adaptive Elastic Net estimator is the new procedure proposed in this study.

To get confidence intervals for nonzero parameters, one can run the adaptive elastic net first and find the zero and nonzero coefficients. Then for those nonzero estimates, we have the standard GMM standard errors by Theorem 4. By using that we can calculate confidence intervals for nonzero coefficient parameters.

5 Application

In this part we go through a useful application of the new estimator. The following is the external habit specification model considered by Chen and Ludvigson (2009) (also equation (2.7) in Chen (2007)):

E(ι0(Ct+1Ct)ϕ0(1h0(CtCt+1))ϕ0(1h0(Ct1Ct))ϕ0Rl,t+11zt)=0,

where Ct represents the consumption at time t, and Ι0 and ϕ0 are both positive and they represent time discount and curvature of the utility function respectively. Rl,t+1 is the lth asset return at time t + 1, h0(.) ∈ [0, 1) is an unknown habit formation function, and zt is the information set and this will be linked to valid instruments. We took only one lag in consumption ratio, rather than several of them. The possibility of this specific model is mentioned in p.1069 of Chen and Ludvigson(2009). Chen and Ludvigson (2009) use the sieve estimation to estimate the unknown h0 function. They setup the dimension of the sieve as a given number. In this paper we use the adaptive elastic net GMM to automatically select the dimension of sieve and estimate the structural parameters at the same time. The parameters and the unknown habit function that we try to estimate are δ0, γ0, h0. Now denote

ρ(Ct,Rl,t+1,ι0,ϕ0,h0)=ι0(Ct+1Ct)ϕ0(1h0(CtCt+1))ϕ0(1h0(Ct1Ct))ϕ0Rl,t+11.

Before setting up the orthogonality restrictions, set s0j(zt) is a sequence of known basis functions that can approximate any square integrable function. Then for each l = 1, ⋯, N, j = 1, ⋯, JT, the restrictions are

E[ρ(Ct,Rl,t+1,ι0,ϕ0,h0)s0j(zt)]=0.

In total we have NJT restrictions, and N is fixed, but JT → ∞, as T → ∞, and NJT/T → 0, as T → ∞. The main issue is the approximation of the unknown function h0. Chen and Ludvigson (2009) use sieves to approximate that function. In theory the dimension of sieve KT → ∞, but KT/T → 0, as T → ∞. As Chen and Ludvigson (2009) we use an artificial neural network sieve approximation,

h(Ct1Ct)=ζ0+j=1KTζjΨ(τjCt1Ct+κj),

where Ψ(.) is an activation function, and this is chosen as a logistic function Ψ(x) = (1 + e−x)−1. This implies that in order to estimate the habit function, we need 3KT + 1 parameters. The parameters are ζ0, ζj, τj, κj, j = 1, ⋯, KT. Chen and Ludvigson (2009) use KT = 3. In our paper, along with parameters δ0, γ0, estimation of h0 through selection of correct sieve dimension will be done. So if true dimension of sieve is KT0, with 0 ≤ KT0KT, then the Adaptive Elastic Net GMM aims to estimate that dimension (through estimation of parameters in the habit function). The total number of parameters to be estimated is p = 3(KT + 1), since we also estimate Ι0, ϕ0 in addition to the habit function parameters. The number of orthogonality restrictions is q = NJT, and we assume q = NJT ≥ 3(KT + 1) = p. Chen and Ludvigson (2009) use sieve minimum distance estimator, and Chen (2007) uses sieve GMM estimator to estimate the parameters. Specifically, equation (2.16) in Chen (2007) use unpenalized sieve GMM estimation for this problem. Instead we will assume that true dimension of the sieve is unknown and will estimate the parameters along with habit function (parameters in that function) with the adaptive elastic net GMM estimator. Set β = (Ι, τ, h) and the compact sieve space is Bp = Bδ×Bγ×HT. The compactness assumption is discussed in p.1067 of Chen and Ludvigson (2009), which is mainly needed so that sieve parameters do not generate tail observations on Ψ(.). Also set the approximating known basis functions as s(zt) = (s0,1(zt, ⋯, s0,JT(zt))’, which is a JT × 1 vector. So JT = 3, there are three instruments. These are a constant, lagged consumption growth, and its square1. There are seven asset returns that are used in the study so N = 7. The detailed explanations can be found in Chen and Ludvigson (2009).

Implementation Details

1. First, we run the elastic net GMM to obtain the adaptive weights wj’s. The elastic net GMM has the same objective function as the adaptive version but with wj = 1 for all j. The enet-GMM estimator is obtained by setting the weights as 1 in the estimator in the third step given as follows.

2.Then for the weights, since a priori it is known that nonzero coefficients cannot be large positive, we use γ = 2.5 in the exponent. For specifying the weights, wj=1β^enet,j2.5 is chosen for all j. We have also experimented with γ = 4.5 as in the simulations, and the results were mildly different but qualitatively very similar, so those are not reported.

3. Our adaptive elastic net GMM estimator is:

β^=(1+λ2T)argminβBp{[t=1Tρ(Ct,Rt+1,β)s(wt)]W^[t=1Tρ(Ct,Rt+1,β)s(wt)]+λ1j=13(KT+1)w^jβj+λ2j=13(KT+1)βj2},

where β1 = ι, β2 = ϕ, and the remaining 3KT + 1 parameters correspond to the habit function estimation by sieves. We use the following weight to make the comparison with Chen and Ludvigson (2009) in a fair way:

W^=I(SS),

where S is (s(z1), ⋯, s(zt), ⋯, s(zT))’, which is T × 3 matrix, where we use Moore-Penrose inverse as described in (2.16) of Chen (2007). Note that ρ(Ct, Rl,t+1, β) is an N × 1 vector, and

ρ(Ct,Rt+1,β)=(ρ(Ct,R1,t+1,β),,ρ(Ct,Rl,t+1,β),,ρ(Ct,RN,t+1,β)).

After the implementation steps we explain the data here. The data points start from the second quarter of 1953 and end at the second quarter of 2001. This is a slightly shorter span than Chen and Ludvigson (2009) since we did not want to use missing data cells for certain variables. So NJT = 21 (number of orthogonality restrictions). At first we try KT = 3 like Chen and Ludvigson (2009), but this is the maximal number of sieve dimension, and we estimate sieve parameter with structural parameters and select the model unlike Chen and Ludvigson (2009). We also try KT = 5. When KT = 3, the total number of parameters to be estimated is 12, and if KT = 5, then this number is 18. We also use BIC to choose from three possible tuning parameter choices, λ1=λ1=λ2={0.000001,1,10}. The tuning parameters are taking the same value for ease of computation. So here we will compare our results with unpenalized sieve GMM of Chen (2007). As discussed above we apply a certain subset of the instruments, since the remainder are unavailable, and do not use missing data in Chen and Ludvigson (2009). So our results corresponding to unpenalized sieve GMM will be slightly different than Chen and Ludvigson (2009).

We provide the estimates for our adaptive elastic net GMM method. This is for the case of KT = 5. The time discount estimate ι^=0.88 and the curvature of the utility curve parameter is ϕ^=0.66. The sieve parameter estimates are ζ^0=0, ζ^j=0,0,0,0,0, τ^j=0.086,0.078,0.084,0.073,0.083, κ^j=0.082,0.072,0.087,0.086,0.080, for j = 1, 2, 3, 4, 5 respectively. To compare if we use sieve GMM with imposing KT = 5 as the true dimension of the sieve, we get ι^sg=0.86, ϕ^sg=0.73 for time discount, and curvature parameter, where subscript sg denotes Chen and Ludvigson (2009), Chen (2007) with λ1=λ1=λ2=0. So the results are basically the same as ours for these two parameters. However, the estimates of sieve parameters in unpenalized sieve GMM case is ζ^sg,0=0, ζ^sg,j=0 for all j = 1, ⋯, 5, and τ^sg,j=0.083,0.079,0.079,0.079,0.076, κ^sg,j=0.089,0.086,0.086,0.084,0.082 respectively for j = 1, ⋯, 5. So the habit function in the sieve GMM with KT = 5 is estimated as zero (on the boundary), our method gives the same result.

In Chen and Ludvigson (2009), they fit KT = 3 for the sieve. We provide the estimates for our adaptive elastic net GMM method in this case as well. We also reestimate Chen and Ludvigson (2009). In the adaptive elastic net GMM, the time discount estimate is ι^=0.93 and the curvature of the utility curve parameter is ϕ^=0.64. The sieve parameter estimates are ζ^0=0, ζ^j=0 for j = 1, 2, 3 τ^j=0.057,0.054,0.064, κ^j=0.067,0.066,0.058, for j = 1, 2, 3 respectively. To compare with our method we use the sieve GMM of Chen and Ludvigson (2009) with imposing KT = 3 as the true dimension of the sieve, we get ι^sg=0.94, ϕ^sg=0.71 for time discount, and curvature parameter, where subscript sg denotes Chen and Ludvigson (2009), Chen (2007) with λ1=λ1=λ2=0. So the results are again basically the same as ours for these two parameters. However, the estimates of sieve parameters in unpenalized sieve GMM case is ζ^sg,0=0, ζ^sg,j=0.022,0.025,0.019 for all j = 1, ⋯, 3, and τ^sg,j=0.051,0.055,0.056, κ^sg,j=0.076,0.075,0.075, respectively for j = 1, ⋯, 3. So the habit function in the adaptive elastic net GMM with KT = 3 is estimated as zero (on the boundary), but with the sieve GMM , the estimate of the habit function is positive. In Chen and Ludvigson article (2009), with a larger instrument set than we used for their case, they find the time parameter estimate to be 0.99, and curvature to be 0.76, and the habit function was positive at KT = 3. Note that we use time series data, and as we suggest after our theorems, this is plausible given our technique and the structure of the proofs.

We now discuss how our assumptions fit this application. First, all of our parameters are uniformly bounded in this application, which is discussed in p.1067 of Chen and Ludvigson (2009). Then the second issue is whether the uniform convergence of partial derivative is plausible. This will be satisfied in iid case through Condition 3.5M in Chen (2007). This amounts to uniformly bounded partial derivatives, Lipschitz continuity of partial derivatives, and log of covering numbers to be growing less than rate T.

Assumption 2 is related to convergence of the weight matrix, which is not restrictive, and it shows a relation between q and T, so q cannot grow fast. In our case q = NJT, where N is fixed, so this restricts the growth of JT. Assumption 3(i) is also similar to Assumption 2. For Assumption 3(ii), we have p = 3(KT + 1), since q = NJT ≥ 3(KT + 1) = p, given that NJT/T → 0 provides us the assumption. Assumption 3(iii) is satisfied here by imposing a value between 0 and 1 (including one) for the exponent in the elastic net based weights. Assumption 4 is concerned about the sample moment functions, since all of our variables are stationary, in terms of ratios, and bounded variables such as returns, we do not expect its 2 + l moment to grow larger than T1/2. Assumption 5 is a penalty function, and this is satisfied with using T in place of n.

6 Conclusion

In the paper here we analyze the adaptive elastic net GMM estimator. It can simultaneously select the model and estimate that. The new estimator also allows for a diverging number of parameters. The estimator is shown to have the oracle property, so we can estimate nonzero parameters with the standard GMM limit and the redundant parameters are set to zero by a data-dependent technique. Commonly used AIC, BIC methods as well as our estimator face some non uniform consistency issues of the estimators. If we have to select the model with AIC or BIC and then use GMM , this also has the non uniform consistency issues and also does much worse than the adaptive elastic net estimator considered here. The issue with model selection (i.e. including AIC, BIC) is that all of them are uniformly inconsistent (from model selection perspective). So any arbitrarily local to zero coefficients cannot be selected as nonzero. Leeb and Pőtscher (2005) establish that, in order to get selected, the order of the magnitude of local to zero coefficients should be larger than n−1/2. Between 0 and the magnitude of n−1/2, the model selection is indeterminate. We study the selection issue of local to zero coefficients in the GMM and extend the results of Leeb and Pőtscher (2005) to the GMM with a diverging number of parameters.

Acknowledgments

Zhang’s research is supported by National Science Foundation Grants DMS-0654293, 1347844, 1309507, and National Institutes of Health Grant NIH/NC1 R01 CA-08548. We thank the co-editor Jonathan Wright, an associate editor and two anonymous referees for their comments which have substantially improved the paper. Mehmet Caner also thanks Andres Bredahl Kock for advice on consistency proof.

Appendix.

Proof of Theorem 1(i)

Huang, Horowitz and Ma (2008) analyzed the least squares with Bridge penalty with a diverging number of parameters. Here we extend that to a diverging number of moment conditions with parameters and the Adaptive Elastic Net penalty in nonlinear GMM.

Starting with β^enet definition

(1+λ2n)Sn(β^enet)(1+λ2n)Sn(β0)

which is

[i=1ngi(β^enet)]Wn[i=1ngi(β^enet)]+λ1j=1pβ^j,enet+λ2β^enet22[i=1ngi(β0)]Wn[i=1ngi(β0)]+λ1j=1pβj0+λ2β022. (15)

Then setting ιn=λ1j=1pβj0+λ2βj022=λ1jAβj0+λ2jAβj02, and by (15)

ιn[i=1ngi(β^enet)]Wn[i=1ngi(β^enet)][i=1ngi(β0)]Wn[i=1ngi(β0)] (16)
=2[G^n(β)(β0β^enet)]Wn[i=1ngi(β0)]+[G^n(β)(β0β^enet)]Wn[G^n(β)(β0β^enet)], (17)

via the mean value theorem and β(β0,β^enet). We now try to simplify (17). In this way, set Δn=nWn12[i=1nGi(β)n](β0β^enet) and Dn=n12Wn12[i=1ngi(β0)n12] and Gi(β)=gi(β)β which is a q × p matrix. The next two equations (18)-(20) are from P.603 Huang, Horowitz and Ma (2008). We use them to illustrate our point. Then it is clear that (17) can be written as

ΔnΔn2DnΔnιn0, (18)

which provides us

ΔnDn22Dn22ιn0.

Using the last inequality, we deduct

ΔnDn2Dn2+ιn12,

and by triangle inequality

Δn2ΔnDn2+Dn22Dn2+ιn12. (19)

By using (18)(19) with simple algebra, we have

Δn226Dn22+3ιn. (20)

Note

nE(i=1ngi(β0)n12)Wn(i=1ngi(β0)n12)nE(i=1ngi(β0)n12)Ω1(i=1ngi(β0)n12)=o(1), (21)

with probability approaching one, by substituting W = Ω−1 and by Assumptions 2-3. Then by the definition of Dn and (21)

EDn22nE(i=1ngi(β0)n12)Ω1(i=1ngi(β0)n12)=o(1).

Next,

E(i=1ngi(β0)n12)Ω1(i=1ngi(β0)n12)=tr(E[(i=1ngi(β0)n12)(i=1ngi(β0)n12)]Ω1)=tr(Iq)=q, (22)

Using (22) in (21) with the definition of Dn, with probability approaching one

EDn222=O(nq). (23)

Next by the definition of ∥Δn∥ and Assumption 2, we have

Δn22=n2[(β0β^enet)[(i=1nGi(β)n)Ω1(i=1nGi(β)n)](β0β^enet)]n2Eigmin[(i=1nGi(β)n)Ω1(i=1nGi(β)n)]β^enetβ022n2bβ^enetβ022, (24)

with probability approaching one and seeing (6) with remembering that elastic net is a subcase of β^w. Next use (23)(24) in (20) to have

n2bEβ^enetβ022O(nq)+O(λ1pA+λ2pA), (25)

by seeing that λ1jAβj0+λ2jAβj02=O(λ1pA+λ2pA), given that nonzero parameters can be all constants at most. So by Assumptions 3 and 5, since λ1/n → 0, λ2/n → 0, pA/n → 0, q/n → 0, we have

β^enetβ022P0.

Q.E.D.

Proof of Theorem 1(ii)

The proof for the consistency of the estimator β^w is the same as in the elastic net case in Theorem 1(i). The only difference is the definition of ιn=λ1j=1pw^jβj0+λ2j=1pβj02. Note that when βj0 = 0, we can write ιn=λ1jAw^jβj0+λ2jAβj02. In other words, only nonzero coefficient weights play a role in term Ιn.

As in (25)

n2bEβ^wβ022=O(nq)+O(λ1jAw^jβj0+λ2jAβj02). (26)

The key issue is the order of the penalty terms. We are only interested in nonzero parameter weights. Defining η^=minjAβ^j,enet, we have for jA,

w^j=1β^j,enetγ1η^γ.

Then

λ1jAw^jβj0n2λ1η^γpACn2, (27)

where we use maxjAβj0C, where C is a generic positive constant. So we can write (27)

λ1η^γpACn2=[λ1ηγpACn2][(η^η)γ]. (28)

In the above equation we first show

E(η^η)2=O(1). (29)

First see that by simple algebraic inequality as in (6.13) of Zou and Zhang (2009),

E(η^η)22+2η2E(η^η)22+2η2Eβ^enetβ022. (30)

Next by Theorem 1(i) (equation (25))

1η2Eβ^enetβ022=O(qn+λ1pA+λ2pAn2η2),

since λ1/n → 0, λ2/n → 0 and pAq, the largest stochastic order is qn/n2η2 = q/2. By q = nν, and 0 < ν < 1 with n1–νη2 → ∞ by Assumption 5(iii) clearly

1η2Eβ^enetβ022=o(1), (31)

Next by (30)(31) we have

E(η^η)22+o(1).

So (29) is shown, and by Markov’s inequality

(η^η)2=Op(1),

clearly (η^η)2op(1). Note that

(η^η)γ=[(η^η)2]γ2. (32)

Then by (32) we have

(η^η)γ=Op(1).

Next on the right hand side of (28)

λ1npACnηγ0,

since λ1/n → 0 by Assumption 5(i), and by Assumption 5(iv) n1–αηγ → ∞. The last two equations illustrate that (28)

λ1η^γpACn2=[λ1ηγpACn2][(η^η)γ]P0. (33)

Next in (26) we have

λ2jAβj02n2=O(λ2pAn2)=o(1), (34)

by λ2/n → 0, pA/n → 0. Also note that by q/n → 0 through Assumption 3 combining (33)(34) in (26) above provides us

Eβ^wβ022=O(nqn2)+O(λ1jAw^jβj0+λ2jAβj02n2)=o(1). (35)

Q.E.D.

Proof of Theorem 2

In this proof we start by analyzing the GMM-Ridge estimator that is defined as follows:

β^R+argminβ[i=1ngi(β)]Wn[i=1ngi(β)]+λ2β22.

Note that this estimator is similar to the elastic net estimator, if we set λ1 = 0, in elastic net estimator, then we obtain the GMM-Ridge estimator. So since the elastic net estimator is consistent, GMM-Ridge will be consistent as well. Define also the following q×p matrix G^n(β^R)=i=1ngi(β^R)β. Then set β(β0,β^R). Note that G^n(β) is the value of G^n(.) evaluated at β. A mean value theorem around β0 applied to first order conditions provides, with gn(β0)=i=1ngi(β0),

β^R=[G^n(β^R)WnG^n(β)+λ2Ip]1[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]. (36)

Also using the mean value theorem with first order conditions, adding and subtracting λ2 β0 from first order conditions yields

β^Rβ0=[G^n(β^R)WnG^n(β)+λ2Ip]1[G^n(β^R)Wngn(β0)+λ2β0]. (37)

We need the following expressions by using (36)

β^R[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]=[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]×[G^n(β^R)WnG^n(β)+λ2Ip]1×[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]. (38)
β^R[G^n(β^R)WnG^n(β)+λ2Ip]β^R=[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]×[G^n(β^R)WnG^n(β)+λ2Ip]1×[G^n(β^R)Wngn(β0)G^n(β^R)WnG^n(β)β0]. (39)

Next the aim is to rewrite the following GMM-Ridge objective function via a mean value expansion

[i=1ngi(β^R)]Wn[i=1ngi(β^R)]+λ2β^R22=gn(β0)Wngn(β0)+gn(β0)WnG^n(β)(β^Rβ0)+(β^Rβ0)G^n(β)Wngn(β0)+(β^Rβ0)G^n(β)WnG^n(β)(β^Rβ0)+λ2β^R22. (40)

When λ1 = 0, using Theorem 1 we see that β^Rβ22P0. Then use Assumption 1 to have

β^R[(G^n(β^R)n)Wn(G^n(β)n)+λ2Ip]β^R=β^R[(G^n(β)n)Wn(G^n(β)n)+λ2Ip]β^R+β^R[oP(1)Wn(G^n(β)n)+λ2Ip]β^R, (41)

where the oP(1) term comes from the uniform law of large numbers. Clearly the stochastic order of the second term is smaller than the first one on the right hand side of (41). By using the same argument to get (41), we have

β^R[(G^n(β^R)n)Wngn(β0)(G^n(β^R)n)Wn(G^n(β)n)β0]=β^R[(G^n(β)n)Wngn(β0)(G^n(β)n)Wn(G^n(β)n)β0]+β^R[oP(1)Wngn(β0)oP(1)Wn(G^n(β)n)β0]. (42)

Again the second term’s stochastic order is smaller than the first one in (42).

Furthermore we can rewrite the right hand side of (40)

gn(β0)Wngn(β0)+β^R[G^n(β)Wngn(β0)G^n(β)WnG^n(β)β0]+[G^n(β)Wngn(β0)G^n(β)WnG^n(β)β0]β^R+β^R[G^n(β)WnG^n(β)+λ2Ip]β^Rgn(β0)WnG^n(β)β0β0G^n(β)Wngn(β0)+β0G^n(β)WnG^n(β)β0=gn(β0)Wngn(β0)β^R[G^n(β)WnG^n(β)+λ2Ip]β^Rgn(β0)WnG^n(β)β0β0G^n(β)Wngn(β0)+β0G^n(β)WnG^n(β)β0β^Rso1β^R, (43)

where so1 represents the small order terms mentioned in (41)(42). The equality is obtained through (38)(39) via Assumption 1 and the consistency of GMM-Ridge since ridge is a subcase of elastic net with λ1 = 0 imposed on the elastic net. Note that the right hand side expression in (38) is just the negative of the right hand side of the expression in (39). As in (43), for the estimator β^w we have the following where βw(β0,β^w),

[i=1ngi(β^w)]Wn[i=1ngi(β^w)]+λ2β^w22=gn(β0)Wngn(β0)+β^w[G^n(βw)Wngn(β0)G^n(βw)WnG^(βw)β0]+[G^n(βw)Wngn(β0)G^n(βw)WnG^n(βw)β0]β^w+β^w[G^n(βw)WnG^(βw)+λ2Ip]β^wgn(β0)WnG^n(βw)β0β0G^n(βw)Wngn(β0)+β0G^n(βw)WnG^n(βw)β0. (44)

Then see that by Theorem 1 β^βw22P0 and using (36)

β^w[G^n(β)Wngn(β0)G^n(β)WnG^(β)β0]=β^w[G^n(β)WnG^n(β)+λ2Ip][G^n(β)WnG^n(β)+λ2Ip]1×[G^n(β)Wngn(β0)G^n(β)WnG^n(β)β0]=β^w[G^n(β)WnG^n(β)+λ2Ip]β^Rβ^wso2β^R. (45)

The term so2 above comes from the same type of analysis done for the second and small stochastic order terms in (41)(42). Next substitute (45) into (44) to have

[i=1ngi(β^w)]Wn[i=1ngi(β^w)]+λ2β^w22=gn(β0)Wngn(β0)β^w[G^n(β)WnG^n(β)+λ2Ip]β^Rβ^R[G^n(β)WnG^n(β)+λ2Ip]β^w+β^w[G^n(β)WnG^n(β)+λ2Ip]β^wgn(β0)WnG^n(β)β0β0G^n(β)Wngn(β0)+β0G^n(β)WnG^n(β)β0β^wso2β^Rβ^Rso3β^w+β^wso4β^w. (46)

Term so3 is the transpose of so2, and term so4 comes from approximation error between β and βw. Specifically note that β^wso2β^Rβ^Rso3β^wβ^wso4β^w are smaller order terms than the second, third, and fourth terms on the right hand side of (46) respectively. Denote so5 = min(so1, so2, so3, so4). Now subtract (43) from (46) to have

[i=1ngi(β^w)]Wn[i=1ngi(β^w)]+λ2β^w22([i=1ngi(β^R)]Wn[i=1ngi(β^R)]+λ2β^R22)(β^wβ^R)[G^n(β)WnG^n(β)+λ2Ip](β^wβ^R)+(β^wβ^R)so5(β^wβ^R). (47)

The next analysis is very similar to equations (6.1)-(6.6) of Zou and Zhang (2009). After this important result see that by the definitions of β^w and β^R

λ1j=1pw^j(β^j,Rβ^j,w)(i=1ngi(β^w))Wn(i=1ngi(β^w))+λ2β^w22[(i=1ngi(β^R))Wn(i=1ngi(β^R))+λ2β^R22]. (48)

Then also see that

j=1pw^j(β^j,Rβ^j,w)j=1p(w^j)2β^Rβ^w2. (49)

Next benefit from (48), with (47)(49) to have

(β^wβ^R)[G^n(β)WnG^n(β)+λ2Ip](β^wβ^R)+(β^wβ^R)so5(β^wβ^R)λ1j=1p(w^j)2β^Rβ^w2. (50)

See that Eigmin(G^n(β)WnG^n(β)+λ2Ip)=Eigmin(G^n(β)WnG^n(β))+λ2. Use this in (50) to have

β^wβ^R2λ1j=1pw^j2Eigmin(G^n(β)WnG^n(β))+λ2+so5, (51)

where Eigmin(G^n(β)WnG^n(β)) is a term of larger stochastic order than so5 which is explained in (46) (43). We also want to modify the last inequality. By the consistency of β^w,β^R,βPβ0. Then with the uniform law of large numbers on the partial derivative we have by Assumptions 1-3

[G^n(β)n]Wn[G^n(β)n]G(β0)WG(β0)22P0.

The last equation is true also for β^w, β^R replacing β. Then

G^n(β)WnG^n(β)n2[G(β0)WG(β0)]o(n2)22P0. (52)

Using Lemma A0 of Newey and Windmeijer (2009b), modify (51) in the following way given the last equality, set W = Ω−1 (since this is the efficient limit weight as shown by Hansen (1982))

β^wβ^R2λ1j=1pw^j2n2Eigmin(G(β0)Ω1G(β0))+λ2+oP(n2). (53)

Now we consider the second part of the proof of this theorem. We use the GMM ridge formula. Note that from (37)

β^Rβ0=λ2[G^n(β^R)WnG^n(β)+λ2Ip]1β0[G^n(β^R)WnG^n(β)+λ2Ip]1[G^n(β^R)Wngn(β0)]. (54)

We try to modify the equation above a little.

In the same way we obtain (52)

G^n(β^R)Wngn(β0)n[G(β0)Wgn(β0)]oP(n)22P0. (55)

Second, see that by gi(β) being independent Egn(β0)gn(β0)’ – nΩ → 0.

E[gn(β0)Ω1G(β0)G(β0)Ω1gn(β0)]=tr{G(β0)Ω1E[gn(β0)gn(β0)]Ω1G(β0)}=ntr{G(β0)Ω1G(β0)}+o(1)npEigmax(Σ)+o(1), (56)

where we use Σ = G(β0)’Ω−1G(β0). Now we modify (54) using (55) (52), W = Ω−1.

β^Rβ0=λ2[n2G(β0)Ω1G(β0)+λ2Ip+oP(n2)]1β0[n2G(β0)Ω1G(β0)+λ2Ip+oP(n2)]1[nG(β0)Ω1gn(β0)+oP(n)]. (57)

Then see that

E(β^Rβ022)2λ22[n2Eigmin(G(β0)Ω1G(β0))+λ2+o(n2)]2β022+2[n2Eigmin(G(β0)Ω1G(β0))+λ2+o(n2)]2×n2E[gn(β0)Ω1G(β0)G(β0)Ω1gn(β0)]+o(n2)2[n2Eigmin(G(β0)Ω1G(β0)+λ2+o(n2)]2×[λ22β02+n3pEigmax(Σ)+o(n2)], (58)

where the last inequality is by (56). Note that we do not use orders smaller than o(n2) in (58) since this will not make any difference in the proofs of the theorems below. Now use (53) and (58) to have

E(β^wβ022)2E(β^Rβ022)+2E(β^wβ^R22)4λ22β022+4n3pB+o(n2)+2λ12Ej=1pw^j2[n2b+λ2+o(n2)]2. (59)

See that b ≤ Eigmin(G(β0)’Ω−1G(β0)) = Eigmin(Σ), B ≥ Eigmax(Σ).

Q.E.D

Proof of Theorem 3(i)

The proof is similar to proof of Theorem 3.2 in Zou and Zhang (2009). The differences are due to nonlinear nature of the problem. Our upper bounds in Theorem 2 converge to zero at a different rate than Zou and Zhang (2009). To prove the theorem, we need to show the following (Note that by Kuhn-Tucker conditions of (1)),

P[jAc,2G^n,j(β~)Wn(i=1ngi(β~))λ1w^j]1,

where G^n(β~)=i=1ngi(β~)β, and Ac = {j : βj0 = 0, j = 1,2, ⋯, p}. G^n,j(β~) denotes the j th column of the partial derivative matrix which corresponds to irrelevant parameters, evaluated at β~. Or we need to show

P[jAc,2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j]0,

Now set η = minjAβj0, η^=minjAβ^j,enet, and A = {j : βj0 ≠ 0, j = 1,2, ⋯, p}. So

P[jAc,2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j]jAcP[2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j,η^>η2]+P[η^η2]. (60)

Then as in equation (6.7) of Zou and Zhang (2009), we can show that

P[η^η2]Eβ^enetβ022η2416λ22β022+n3pB+λ12p+o(n2)[n2b+λ2+o(n2)]2η2, (61)

where the second inequality is due to Theorem 2. Then we can also have

jAcP[2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j,η^>η2]jAcP[2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j,η^>η2,β^j,enetM]+jAcP[β^j,enet>M]jAcP[2G^n,j(β~)Wn(i=1ngi(β~))>λ1Mγ,η^>η2]+jAcP[β^j,enet>M], (62)

where M=(λ1n3+α)12γ. Compared to Zou and Zhang (2009), M converges to zero faster.

In (62) we consider the second term on the right hand side. Via inequality (6.8) of Zou and Zhang (2009) and Theorem 2, we have

jAcP[β^j,enet>M]Eβ^enetβ022M24λ22β022+4n3pB+λ12p+o(n2)[n2b+λ2+o(n2)]2M2. (63)

Next we can consider the first term on the right hand side of (62)

jAcP[2G^n,j(β~)Wn(i=1ngi(β~))>λ1Mγ,η^>η2]4M2γλ12E[jAcG^n,j(β~)Wn(i=1ngi(β~))2I{η^η2}]. (64)

So we try to simplify the term on the right hand side of (64). Now we evaluate

jAcG^n,j(β~)Wn(i=1ngi(β~)22jAcG^n,j(β~)Wn(i=1ngi(βA,0)2+2jAcG^n,j(β~)WnG^n(β)(β~βA,0)2, (65)

where we have β(βA,0,β~), and

gi(β~)=gi(βA,0)+[gi(β)β](β~βA,0).

Analyze each term in (65). Note that β~ is consistent if we go through the same steps as in Theorem 1 by Assumption 5 on λ1. Then applying Assumption 1 with Assumption 2 (Uniform Law of Large Numbers) in the same way as in equation (6.9) of Zou and Zhang (2009), we have

2jAcG^n,j(β~)Wn(i=1ngi(βA,0)22n2G(βA,0)Ω1i=1ngi(βA,0)22+oP(n2), (66)

where we put W=Ω1 and use Assumption 3(i), [n1Σi=1nEgi(βA,0)gi(βA,0)]1Ω1220 as the definition of the efficient limit weight matrix for the case of nonzero parameters to have

E[jAcG^n,j(β~)Wn(i=1ngi(βA,0)]2n3B+o(n3), (67)

where we use

BEigmax(Σ)Eigmax(G(βA,0)Ω1G(βA,0)). (68)

See that Σi=1ngi(βA,0)=OP(n12).

In the same manner as in equation (6.9) of Zou and Zhang (2009), we have

jAcG^n,j(β~)WnG^n(β)(β~βA,0)2n4G(βA,0)Ω1G(βA,0)(β~βA,0)2+oP(n4). (69)

Then by (68) and taking into account (69), we have

jAcG^n,j(β~)WnG^n(β)(β~βA,0)2n4B2β~βA,022+oP(n4). (70)

Now substitute (67)-(70) into the term on the right hand side of (64), we get

E[jAcG^n,j(β~)Wn(i=1ngi(β~))2I{η^>η2}]2B2n4E(β~βA,022I{η^>η2})+2Bn3+o(n4). (71)

Define the ridge based version of β~ with imposing λ1=0

β~(λ2,0)=argminβ{(i=1ngi(βA)Wn(i=1ngi(βA))+λ2jAβj2}. (72)

Then use the arguments in the proof of Theorem 2 (equation (53)), we have

β~β~(λ2,0)2λ1maxjAw^jpn2Eigmin(G(βA,0)Ω1G(βA,0))+λ2+oP(n2)λ1η^γpn2b+λ2+oP(n2), (73)

where Eigmin(G(βA,0)Ω1G(βA,0))Eigmin(G(β0)Ω1G(β0))b.

Then follow the proof of Theorem 2 (i.e. equation (59)), for the right hand side term in (71),

E(β~βA,022I{η^>η2})4λ22β022+n3pB+λ12(η2)2γp+o(n2)(bn2+λ2+o(n2))2. (74)

Now combine (61) (62)(63)(64)(71)(74) into (60),

P[jAc,2G^n,j(β~)Wn(i=1ngi(β~))>λ1w^j]16η2[λ22β022+Bn3p+λ12p+o(n2)[n2b+λ2+o(n2)]2]+4M2[λ22β022+Bn3p+λ12p+o(n2)[n2b+λ2+o(n2)]2]+4M2γλ12[(2n3B+o(n3))+(2B2n4+o(n4))×(λ22β022+n3pB+λ12(η2)2γp+o(n2)(bn2+λ2+o(n2))2)]. (75)

Now we need to show that each square bracketed term on the right hand side of equation (75) converges in probability to zero. We consider each of the right hand side elements in (75). We start with the first square bracketed term. The orders of the expressions are

pnη2λ22n30,pnη20,pnη2λ12n30,

by λ12n30,λ22n30 via Assumption 5, and

pnη2=1n1αη20, (76)

by Assumption 5(iv) and 3 with n1–νη2 → ∞ with αν. Note that β022=O(p) by Assumption 1.

Next we consider the second square bracketed term on the right hand side of (75). See that the dominating term in that expression is stochastic order of

O(pn1M2)0.

The above is true since by Assumption 5, λ1n3+αnγ(1α), and M=[λ1n3+α]12γ since γ > 0

M2np=(λ1n3+α)1γn1α.

The other terms in the second term on the right hand side of (75) are

λ22n3β022nM20,

by λ22n30,β022=O(p),pnM20 by the analysis of the dominating term above. Then also in the same way

λ12pM2n4=λ12n3pnM20,

where we use λ12n30, and the analysis of the dominating term pnM20 above.

Now we consider the last square bracketed term in (75). The orders of the expression are M2γn3λ12,M2γλ22pλ12,M2γn3pλ12,M2γλ12pλ12η2γ respectively. Clearly the order of the third expression dominates that of the first and second expressions, given λ22n30. Now the order of the third term is

M2γn3pλ12=1λ10,

given M=(λ1n3+α)2γ definition. For the order of the last expression

M2γλ12λ12η2γp=M2γpη2γ=λ1n1n2η2γ0,

since λ1n0,p=nα, p = nα, M definition, and by Assumption 5(iv).

Q.E.D.

Proof of Theorem 3(ii)

The proof technique is very similar to the proof of Theorem 3.3 in Zou and Zhang (2009), given Theorems 2 and 3(i). After Theorem 3(i) result, it suffices to prove

P(minjAβ~j>0)1.

Then we can write the following with β~(λ2,0) defined in (72)

minjAβ~j>minjAβ~(λ2,0)jβ~β~(λ2,0)2. (77)

Also see that

minjAβ~(λ2,0)j>minjAβj0β~(λ2,0)βA,02. (78)

Combine (77)(78) to have

minjAβ~j>minjAβj0β~(λ2,0)βA,02β~β~(λ2,0)2. (79)

Now we consider the last two terms on the right hand side of (79). Similar to derivation of (57)(58) we have

Eβ~(λ2,0)βA,0224λ22βA,022+4n3pB+o(n2)[n2b+λ2+o(n2)]2=O(pn)=o(1), (80)

by λ22βA,022n4=(λ22n2)(βA,022n2)0, and βA,022=O(p). Then by (73)

β~β~(λ2,0)2λ1η^γpn2b+λ2+oP(n2). (81)

See that

λ1η^γpn2b+λ2+oP(n2)=O((η^η)γλ1pn2ηγ). (82)
λ1pn2ηγ=1p12λ1n(pn1ηγ)=1nα2o(1), (83)

by (76) (p = nα), Assumptions 5(i),(iv). Then by Theorem 2

E[(η^η)2]2+1η2E[η^η]22+2η2Eβ^(λ2,λ1)β0222+8η2[λ22β022+n3pB+λ12p+o(n2)[n2b+λ2+o(n2)]2]=O(1), (84)

by λ1/n → 0, λ2/n → 0, p/n → 0, λ1n0,λ2n0,pn0β022=O(p), by (76). Using the same technique in the proof of Theorem 1(ii), substitute (83)(84) into (82) to have

λ1η^γpn2b+λ2+oP(n2)P0, (85)

Now use (80) and (83) with η = minjAβj0∣ to have

minjAβ~j>ηpnOP(1)1nαoP(1). (86)

Then we obtain the desired result since by Assumption 5(v) and 5(iv) the last two terms on right hand side of (86) converges to zero faster than η.

Q.E.D

Proof of Theorem 4

We now prove the limit result. First, define

zn=δ[I+λ2(G^(β^aenet,A)WnG^(β^aenet,A))11+λ2n](G^(β^aenet,A)WnG^(β^aenet,A))12n12(β^aenet,AβA,0).

Then we need the following result. Following the proof of Theorem 1 and using Theorem 3,

β^aenet,AβA,022P0,

and by (80)

β~(λ2,0)βA,022P0.

Next by Assumption 1 and Assumption 2, and considering the above results about consistency, we have

G^(β^aenet,A)nWnG^(β^aenet,A)nG^(β~(λ2,0))nWnG^(β~(λ2,0))n22P0. (87)

Next note that as in p.18 of Zou and Zhang (2009)

δ[I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1]×[G^(β~(λ2,0))WnG^(β~(λ2,0))]12n12(β~βA,01+λ2n)=[δ(I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1)(G^(β~(λ2,0))WnG^(β~(λ2,0)))12×n12λ2βA,0n+λ2]+[δ(I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1)(G^(β~(λ2,0))WnG^(β~(λ2,0)))12×n12(β~β~(λ2,0))+[δ(I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1)(G^(β~(λ2,0))WnG^(β~(λ2,0)))12×n12(β~(λ2,0)βA,0)]. (88)

The last term on the right hand side of (88) can be rewritten as (by (37))

{λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1+I}×[G^(β~(λ2,0))WnG^(β~(λ2,0))]12n12[β~(λ2,0)βA,0]=λ2n12[G^(β~(λ2,0))WnG^(β~(λ2,0))]12βA,0n12[G^(β~(λ2,0))WnG^(β~(λ2,0))]12×[G^(β~(λ2,0))Wngn(βA,0)]+oP(1), (89)

where gn(βA,0)=Σi=1ngi(βA,0). Note that in the above equation we have an asymptotically negligible term due to using (37) where we have β instead of β~(λ2,0).

Via Theorem 3, and also using (87), with probability tending to one, zn = T1 + T2 + T3, where

T1=[δ[(I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1)][G^(β~(λ2,0))WnG^(β~(λ2,0))]12×n12λ2βA,0n+λ2][δλ2n12[G^(β~(λ2,0))WnG^(β~(λ2,0))]12βA,0].T2=δ[I+λ2[G^(β~(λ2,0))WnG^(β~(λ2,0))]1][G^(β~(λ2,0))WnG^(β~(λ2,0))]12×n12(β~β~(λ2,0)).T3=δ[G^(β~(λ2,0))WnG^(β~(λ2,0))]12[G^(β~(λ2,0))Wni=1ngi(βA,0)n12].

Consider T1, use Assumptions 1, 2, 5 with Wn=Ω^1,W=Ω1

T122n(I+λ2(G(βA,0)WG(βA,0))1)(G(βA,0)WG(βA,0))12λ2βA,0n+λ222+2nλ2(G(βA,0)WG(βA,0))12βA,022+oP(1)2nλ22Bn(n+λ2)2(1+λ2bn)2βA,022+2nλ22βA,0221bn2+oP(1)=oP(1),

via λ2=o(n),βA,022O(p), and λ22βA,022n20. Consider T2 similar to the above analysis and (73)

T221n(1+λ2bn)2(Bn)β~β~(λ2,0)22B(1+λ2bn)2[λ1η^γp[n2b+λ2+oP(n2)]]2=oP(1),

by (85) and λ2 = o(n).

For the term T3 we benefit from Liapunov Central Limit Theorem. By Assumptions 1 and 2

T3=i=1nδ[G^(β~(λ2,0))Ω^1G^(β~(λ2,0))]12[G^(β~(λ2,0))Ω^1]gi(βA,0)n12=i=1nδ[G(βA,0)Ω1G(βA,0)]12[G(βA,0)Ω1]gi(βA,0)n12+oP(1).

Next set Ri=δ[G(βA,0)Ω1G(βA,0)]12[G(βA,0)Ω1]gi(βA,0)n12. So

i=1nERi2=n1i=1nE[δ[G(βA,0)Ω1G(βA,0)]12G(βA,0)Ω1gi(βA,0)gi(βA,0)Ω1G(βA,0)[G(βA,0)Ω1G(βA,0)]12δ]=δ[G(βA,0)Ω1G(βA,0)]12G(βA,0)Ω1[n1i=1nEgi(βA,0)gi(βA,0)]Ω1G(βA,0)[G(βA,0)Ω1G(βA,0)]12δ.

Then use n1Σi=1nEgi(βA,0)gi(βA,0)Ω220. Take the limit of the term above

limni=1nERi2=δ[G(βA,0)Ω1G(βA,0)]12G(βA,0)Ω1ΩΩ1G(βA,0)[G(βA,0)Ω1G(βA,0)]12δ.=δδ=1.

Next we show Σi=1nERi2+l0. See that by (6)(7) and 4

i=1n1n1+l2E[δ(G(βA,0)Ω1G(βA,0))12G(βA,0)Ω1gi(βA,0)]2+l[Bb]1+l21n1+l2i=1nEΩ12gi(βA,0)2+l2+l0.

So T3dN(0,1). The desired result then follows from zn = T1 + T2 + T3 with probability approaching one.

Q.E.D.

Footnotes

1

There are more instruments that are used by Chen and Ludvigson (2009), but only these three are available to us. Also we thank Sydney Ludvigson to remind us the discrepancy in the unused instruments in her website and the Journal of Applied Econometrics website.

Contributor Information

Mehmet Caner, Department of Economics, 4168 Nelson Hall, North Carolina State University, Raleigh, NC 27518..

Hao Helen Zhang, Department of Mathematics, University of Arizona, Tucson, AZ 85718. hzhang@math.arizona.edu; and Department of Statistics, North Carolina State University, Raleigh, NC 27695. hzhang@stat.ncsu.edu..

References

  1. Abadir KM, Magnus JR. Matrix Algebra. Cambridge University Press; 2005. [Google Scholar]
  2. Ai C, Chen X. Effcient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
  3. Alfaro L, Kalemli-Ozcan S, Volosoych V. Why does not capital flow from rich to poor countries? An empirical investigation. Review of Economics and Statistics. 2008;90:347–368. [Google Scholar]
  4. Andrews DWK, Lu B. Consistent model and moment selection procedures for GMM estimation with application to dynamic panel data models. Journal of Econometrics. 2001;101:123–165. [Google Scholar]
  5. Caner M. Lasso-type GMM estimator. Econometric Theory. 2009;25:270–290. [Google Scholar]
  6. Chen X. Large sample sieve estimation of semi-nonparametric models. Handbook of Econometrics. 2007;Vol6b:5550–5623. Chapter 76. [Google Scholar]
  7. Chen X, Ludvigson C. Land of Addicts? An empirical investigation of habit based asset pricing models. Journal of Applied Econometrics. 2009;24:1057–1093. [Google Scholar]
  8. Davidson J. Stochastic Limit Theory. Oxford University Press; 1994. [Google Scholar]
  9. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  10. Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]
  11. Gao X, Huang J. Asymptotic analysis of high dimensional lad regression with lasso. Statistica Sinica. 2010;20:1485–1506. [Google Scholar]
  12. Guggenberger P, Smith RJ. Generalized Empirical Likelihood Tests in time series models with potential identification failure. Journal of Econometrics. 2008;142:134–161. [Google Scholar]
  13. Han C, Phillips PCB. GMM with many moment conditions. Econometrica. 2006;74:147–192. [Google Scholar]
  14. Hansen LP. Large sample properties of generalized method of moment estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]
  15. He X, Shao QM. On parameters of increasing dimensions. Journal of Multivariate Analysis. 2000;75:120–135. [Google Scholar]
  16. Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Annals of Statistics. 2008;36:587–613. [Google Scholar]
  17. Huber P. Robust regression: Asymptotics, conjectures and monte carlo. Annals of Statistics. 1988;1:799–821. [Google Scholar]
  18. Knight K, Fu W. Asymptotics for lasso type estimators. Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  19. Knight K. Shrinkage estimation for nearly-singular designs. Econometric Theory. 2008;24:323–338. [Google Scholar]
  20. Lam C, Fan J. Profile-kernel likelihood inference with diverging number of parameters. Annals of Statistics. 2008;36:2232–2260. doi: 10.1214/07-AOS544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Leeb H, Pötscher B. Model selection and inference: facts and fiction. Econometric Theory. 2005;21:21–59. [Google Scholar]
  22. Liao Z. Adaptive GMM shrinkage estimation with consistent moment selection. Yale University. Department of Economics; 2011. Working paper. [Google Scholar]
  23. Newey W, Powell J. Instrumental variable estimation of nonparametric models. Econometrica. 2003;71:1557–1569. [Google Scholar]
  24. Newey W, Windmeijer F. GMM with many weak moment conditions. Econometrica. 2009a;77:687–719. [Google Scholar]
  25. Newey W, Windmeijer F. Supplement to GMM with many weak moment conditions. Econometrica. 2009b;77:687–719. [Google Scholar]
  26. Otsu T. Generalized Empirical Likelihood Inference for nonlinear and time series models under weak identification. Econometric Theory. 2006;22:513–527. [Google Scholar]
  27. Portnoy S. Asymptotic behavior of M-estimators of p-regression parameters when p2/n is large. I. consistency. Annals of Statistics. 1984;12:1298–1309. [Google Scholar]
  28. Tibshirani R. Regression shrinkage and selection via the lasso. J. Royal. Statistical Society Series B. 1996;58:267–288. [Google Scholar]
  29. Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683. [Google Scholar]
  30. Wang H, Li R, Tsai C. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhang H, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;37:1–13. [Google Scholar]
  32. Zou H. The adaptive lasso and its oracle properties. Journal of The American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  33. Zou H, Hastie T, Tibshirani R. On the degrees of freedom of the lasso. Annals of Statistics. 2007;35:2173–2192. [Google Scholar]
  34. Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES