Abstract
In this paper, we develop procedures to construct simultaneous confidence bands for potentially infinite-dimensional parameters after model selection for general moment condition models where is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.
Keywords: Inference after model selection, moment condition models with a continuum of target parameters, Lasso and Post-Lasso with functional response data
1. Introduction.
High-dimensional models have become increasingly popular in the last two decades. Much research has been conducted on estimation of these models. However, inference about parameters in these models is much less understood, although the literature on inference is growing quickly; see the list of references below. In this paper, we construct simultaneous confidence bands for many functional parameters parameters in a very general framework of moment condition models, where each parameter itself can be an infinite-dimensional object, and the number of such parameters can be much larger than the sample size of available data. Our paper builds upon [11], where simultaneous confidence bands have been constructed for many scalar parameters in a high-dimensional sparse z-estimation framework.
As a substantive application, we apply our general results to provide simultaneous confidence bands for parameters in a logistic regression model with functional response data
(1.1) |
where is a -vector of covariates whose effects are of interest, X = (X1,…,Xp)′ is a p-vector of controls, is the logistic link function, is a set of indices and for each , for some constants and the response variable Y, is a vector of target parameters and βu = (βu1,…,βup)′ is a vector of nuisance parameters. Here, both and p are allowed to be potentially much larger than the sample size n, and we have functional target parameters and p functional nuisance parameters . This example is important because it demonstrates that our methods can be used for inference about the whole distribution of the response variable Y given D and X in a high-dimensional setting, and not only about some particular features of it such as mean or median. This model is called a distribution regression model in [22] and a conditional transformation model in [26], who argue that the model provides a rich class of models for conditional distributions, and offers a useful generalization of traditional proportional hazard models as well as a useful alternative to quantile regression. We develop inference methods to construct simultaneous confidence bands for many functional parameters of this model in Section 3.
Toward this goal, our contributions include to effectively estimate a continuum of high-dimensional nuisance parameters, allow for approximately sparse models, control sparse eigenvalues of a continuum of random matrices, establish an approximate linearization for a collection of “orthogonalized” (or “de-biased”) estimators and establish the validity of a multiplier bootstrap for the construction of confidence bands for the many functional parameters of interest based on these estimators. In particular, these contributions build upon but go much beyond [11] (Corollary 4), which considers the special case of many scalar parameters in a z-estimation framework, and beyond [19] (Theorem 5.1), where simultaneous confidence bands are constructed via multiplier bootstrap for any large collection of approximately linear scalar estimator , j = 1,…,p, where the ℓ∞-norm of rn is .
Our general results refer to the problem of estimating the set of parameters in the moment condition model,
(1.2) |
where W is a random element that takes values in a measurable space according to a probability measure P, and are sets of indices, and for each and , ψuj is a known score function, θuj is a scalar parameter of interest and ηuj is a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assuming that a random sample of size n, , from the distribution of W is available together with suitable estimators of ηuj, we aim to construct simultaneous confidence bands for that are valid uniformly over a large class of probability measures P, say . Specifically, for each and , we construct an appropriate estimator of θuj along with an estimator of the standard deviation of , such that
(1.3) |
uniformly over , where α ∈ (0, 1) and cα is an appropriate critical value, which we choose to construct using a multiplier bootstrap method. The left- and the right-hand sides of the inequalities inside the probability statement (1.3) then can be used as bounds in simultaneous confidence bands for θuj’s. In this paper, we are particularly interested in the case when is potentially much larger than n and is an uncountable subset of , so that for each , is an infinite-dimensional (i.e., functional) parameter.
In the presence of high-dimensional nuisance parameters, construction of valid confidence bands is delicate. Dealing with high-dimensional parameters requires relying upon regularization that leads to lack of asymptotic linearization of the estimators of target parameters since regularized estimators of nuisance parameters suffer from a substantial bias and this bias spreads into the estimators of the target parameters. This lack of asymptotic linearization in turn typically translates into severe distortions in coverage probability of the confidence bands constructed by traditional techniques that are based on perfect model selection; see [30–32, 40]. To deal with this problem, we assume that the score functions ψuj are constructed to satisfy a near-orthogonality condition that makes them immune to first-order changes in the value of the nuisance parameter, namely
(1.4) |
for all in an appropriate set where ∂r denotes the derivative with respect to r. We shall often refer to this condition as Neyman orthogonality, since in lowdimensional parametric settings the orthogonality property originates in the work of Neyman on the C(α) test in the 50s. In Section 2 below, we describe a few general methods for constructing the score functions ψuj obeying the Neyman orthogonality condition.
The Neyman orthogonality condition (1.4) is important because it helps to make sure that the bias from the estimators of the high-dimensional nuisance parameters does not spread into the estimators of the target parameters. In particular, under (1.4), it follows that
at least up to the first order, where the index W in EP,W [·] means that the expectation is taken over W only. This makes the estimators of the target parameters θuj immune to the bias in the estimators , which in turn improves their statistical properties and opens up the possibilities for valid inference.
As the framework (1.2) covers a broad variety of applications, it is instructive to revisit the logistic regression model with functional response data (1.1). To construct score functions ψuj that satisfy both the moment conditions (1.2) and the Neyman orthogonality condition (1.4) in this example, for and , define a -vector of additional nuisance parameters
(1.5) |
where , and
(1.6) |
Then, denoting W = (Y,D,X) and splitting θu into θuj and , we set
where and . It is straightforward to see that these score functions ψuj satisfy the moment conditions (1.2) and to see that they also satisfy the Neyman orthogonality condition (1.4), observe that
where the first line by definition of and since VarP (Yu | D, X) = Λ′ (D′ θu + X′ βu), and the second by (1.1). Because of this orthogonality condition, we can exploit the moment conditions (1.2) to construct regular, -consistent, estimators of θuj even if nonregular, regularized or post-regularized, estimators of are used to cope with high-dimensionality. Using these regular estimators of θuj, we then can construct valid confidence bands (1.3).
Our general approach to construct simultaneous confidence bands, which is developed in Section 2, can be described as follows. First, we construct the moment conditions (1.2) that satisfy the Neyman orthogonality condition (1.4), and use these moment conditions to construct estimators of θuj for all and . Second, under appropriate regularity conditions, we establish a Bahadur representation for ’s. Third, employing the Bahadur representation, we are able to derive a suitable Gaussian approximation for the distribution of ’s. Importantly, the Gaussian approximation is possible even if both and the dimension of the index set , du, are allowed to grow with n, and asymptotically remains much larger than n. Finally, from the Gaussian approximation, we construct simultaneous confidence bands using a multiplier bootstrap method. Here, the Gaussian and bootstrap approximations are constructed by applying the results on highdimensional central limit and bootstrap theorems established in [16–21] by verifying the conditions there.
Although regularity conditions underlying our approach can be verified for many models defined by moment conditions, for illustration purposes, we explicitly verify these conditions for the logistic regression model with functional response data (1.1) in Section 3. We also note that the regularity conditions, in particular those related to the entropy of the nuisance parameter estimators, can be substantially relaxed if we use sample splitting, so that the nuisance parameters and parameters of interest are estimated on separate samples; see [15]. In addition, we examine the performance of the proposed procedures in a Monte Carlo simulation study and provide an example based on real data in Section 5. Moreover, in the Supplementary Material [5], we discuss the construction of simultaneous confidence bands based on a double-selection estimator. This estimator does not require to explicitly construct the score functions satisfying the Neyman orthogonality condition but nonetheless is first-order equivalent to the estimator based on such functions.
We also develop new results for ℓ1-penalized M-estimators in Section 4 to handle functional data and criterion functions that depend on nuisance functions for which only estimates are available building on ideas in [3,4,12] (for brevity of the paper, generic results are deferred to Supplementary Material, and Section 4 only contains results that are relevant for the logistic regression model studied in Section 3). Specifically, we develop a method to select penalty parameters for these estimators and extend the existing theory to cover functional data to achieve rates of convergence and sparsity guarantees that hold uniformly over . The ability to allow both for functional data and for nuisance functions is crucial in the implementation and in theoretical analysis of the methods proposed in this paper.
Orthogonality conditions like that in (1.4) have played an important role in statistics and econometrics. In low-dimensional settings, a similar condition was used by Neyman in [37] and [38] while in semiparametric models the orthogonality conditions were used in [1, 35, 36, 41] and [33]. In high-dimensional settings, [7] and [2] were the first to use the orthogonality condition (1.4) in a linear instrumental variables model with many instruments. Related ideas have also been used in the literature to construct confidence bands in high-dimensional linear models, generalized linear models and other nonlinear models; see [6, 8–11, 13, 27,28,43, 46, 47] and [39], where we can interpret each procedure as implicitly or explicitly constructing and solving an approximately Neyman-orthogonal estimating equation. We contribute to this quickly growing literature by providing procedures to construct simultaneous confidence bands for many infinite-dimensional parameters identified by moment conditions.
Throughout the paper, we use the standard notation from the empirical process theory. In particular, we use to denote the expectation with respect to the empirical measure associated with the data , and we use to denote the empirical process . More details about the notation are given in the Supplementary Material.
2. Confidence regions for function-valued parameters based on moment conditions.
2.1. Generic construction of confidence regions.
In this section, we state our results under high-level conditions. In the next section, we will apply these results to construct simultaneous confidence bands for many infinite-dimensional parameters in the logistic regression model with functional response data.
Recall that we are interested in constructing simultaneous confidence bands for a set of target parameters where for each and , the parameter θuj satisfies the moment condition (1.2) with ηuj being a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assume that θuj ∈ Θuj, a finite or infinite interval in , and that ηuj ∈ Tuj, a convex set in a normed space equipped with a norm ∥ · ∥e. We allow to be a possibly uncountable set of indices, and to be potentially large.
We assume that a random sample from the distribution of W is available for constructing the confidence bands. We also assume that for each and , the nuisance parameter ηuj can be estimated by using the same data . In the next section, we discuss examples where ’s are based on Lasso or Post-Lasso methods (although other modern regularization and postregularization methods can be applied). Our confidence bands will be based on the estimators of θuj that are for each and defined as approximate ϵn-solutions in Θuj to sample analogs of the moment conditions (1.2), that is,
(2.1) |
where ϵn = o(δnn−1/2) for all n ≥ 1 and some sequence (δn)n≥1 of positive constants converging to zero.
To motivate the construction of the confidence bands based on the estimators , we first study distributional properties of these estimators. To do that, we will employ the following regularity conditions. Let C0 be a strictly positive (and finite) constant, and for each and , let be some subset of Tuj, whose properties are specified below in assumptions. In particular, we will choose the sets so that, on the one hand, their complexity does not grow too fast with n but, on the other hand, for each and , the estimator takes values in with high probability. As discussed before, we rely on the following nearorthogonality condition.
DEFINITION 2.1 (Near-orthogonality condition). For each and , we say that ψuj obeys the near-orthogonality condition with respect to if the following conditions hold: The Gateaux derivative map
exists for all and and (nearly) vanishes at , namely,
(2.2) |
for all .
At the end of this section, we describe several methods to obtain score functions ψuj that obey the near-orthogonality condition. Together these methods cover a wide variety of applications.
Let ω and c0 be some strictly positive (and finite) constants, and let n0 ≥ 3 be some positive integer. Also, let (B1n)n≥1 and (B2n)n≥1 be some sequences of positive constants, possibly growing to infinity, where B1n ≥ 1 for all n ≥ 1. In addition, denote
(2.3) |
The quantity measures how rich the process is. The quantity Juj measures the degree of identifiability of θuj by the moment condition (1.2). In many applications, it is bounded in absolute value from above and away from zero. Finally, let be a set of probability measures P of possible distributions of W on the measurable space .
We collect our main conditions on the score functions ψuj and the true values of the target parameters θuj in the following assumption.
Assumption 2.1 (Moment condition problem). For all n ≥ n0, , , and , the following conditions hold: (i) The true parameter value θuj obeys (1.2), and Θuj contains a ball of radius centered at θuj. (ii) The map (θ,η) ↦ EP[ψuj(W,θ,η)] is twice continuously Gateaux-differentiable on . (iii) The score function ψuj obeys the nearorthogonality condition given in Definition 2.1 for the set . (iv) For all θ ∈ Θuj, |EP[ψuj(W,θ,ηuj)]| ≥ 2−1|Juj(θ - θuj)| ∧ c0, where Juj satisfies c0 ≤ |Juj | ≤ C0. (v) For all r ∈ [0, 1), θ ∈ Θuj, and η ∈ Tuj:
,
,
.
Assumption 2.1 is mild and standard in moment condition problems. Assumption 2.1 (i) requires θuj to be sufficiently separated from the boundary of Θuj. Assumption 2.1(ii) requires that the functions (θ,η) ↦ EP[ψuj(W,θ,η)] are smooth. It is a mild condition because it does not require smoothness of the functions (θ, η) ↦ ψuj(W,θ,η). Assumption 2.1(iii) is our key condition and is discussed above. Assumption 2.1(iv) implies sufficient identifiability of θuj. In particular, it implies that the equation EP[ψuj(W,θ,ηuj)] = 0 has only one solution θ = θuj. If this equation has multiple solutions, Assumption 2.1(iv) implies that the set Θuj is restricted enough so that there is only one solution in Θuj. Assumption 2.1(v-a) means that the functions (θ, η) ↦ ψuj(W,θ,η) mapping Θuj × Tuj into L2(P) are Lipschitz-continuous at (θ, η) = (θuj,ηuj) with Lipschitz order ω/2. In most applications, we can set ω = 2. Assumptions 2.1(v-b,v-c) impose smoothness bounds on the functions (θ, η) ↦ EP[ψuj(W,θ,η)].
Next, we state our conditions related to the estimators . Let (Δn)n≥1 and (τn)n≥1 be some sequences of positive constants converging to zero. Also, let (an)n≥1, (vn)n≥1, and (Kn)n≥1 be some sequences of positive constants, possibly growing to infinity, where an ≥ n ∨ Kn and vn ≥ 1 for all n ≥ 1. Finally, let q ≥ 2 be some constant.
Assumption 2.2 (Estimation of nuisance parameters). For all n ≥ n0 and , the following conditions hold: (i) With probability at least 1 − Δn, we have for all and . (ii) For all , , and , ∥η − ηuj ∥e ≤ τn. (iii) For all and , we have . (iv) The function class is suitably measurable and its uniform entropy numbers obey
(2.4) |
where F1 is a measurable envelope for that satisfies ∥F1∥P,q ≤ Kn. (v) For all , we have c0 ≤ ∥ f ∥P,2 ≤ C0. (vi) The complexity characteristics an and vn satisfy:
,
,
.
Assumption 2.2 provides sufficient conditions for the estimation of the nuisance parameters . Assumption 2.2 (i) requires that the set is large enough so that with high probability. Assumptions 2.2 (i,ii) together require that the estimator converges to ηuj with the rate τn. This rate should be fast enough so that Assumptions 2.2(vi-b,vi-c) are satisfied. Assumption 2.2(iv) gives a bound on the complexity of the set expressed via uniform entropy numbers, and Assumptions 2.2(vi-a,vi-b) require that the set is small enough so that its complexity does not grow too fast. Assumption 2.2(v) requires that the functions (θ, η) ↦ ψuj(W,θ,η) are scaled properly. Suitable measurability of , required in Assumption 2.2(iv), is a mild condition that is satisfied in most practical cases; see the Supplementary Material and [25] for clarifications. Overall, Assumption 2.2 shows the trade-off in the choice of the sets : setting large, on the one hand, makes it easy to satisfy Assumption 2.2(i) but, on the other hand, yields large values of an and vn in Assumption 2.2(iv) making it difficult to satisfy Assumption 2.2(vi).
We stress that the class does not need to be Donsker because its uniform entropy numbers are allowed to increase with n. This is important because allowing for non-Donsker classes is necessary to deal with high-dimensional nuisance parameters. Note also that our conditions are very different from the conditions imposed in various settings with nonparametrically estimated nuisance functions; see, for example, [44, 45] and [29].
In addition, we emphasize that the conditions stated in Assumption 2.2 are sufficient for our results for the general model (1.2) but can often be relaxed if the structure of the functions ψuj(W,θ,η) is known. For example, it is possible to relax Assumption 2.2(vi) if the functions ψuj(W,θ,η) are linear in θ, which happens in the linear regression model with θ being the coefficient on the covariate of interest; see [9]. Moreover, it is possible to relax the entropy condition (2.4) of Assumption 2.2 by relying upon sample splitting, where part of the data is used to estimate ηuj, and the other part is used to estimate θuj given an estimate of ηuj; see [2] and [15]. By swapping the role of two parts, and averaging the resulting two estimators, we do not incur any efficiency losses.
The following theorem is our first main result in this paper.
Theorem 2.1 (Uniform Bahadur representation). Under Assumptions 2.1 and 2.2, for an estimator that obeys (2.1), we have
(2.5) |
in uniformly over , where and .
Comment 2.1 (On the proof ofTheorem 2.1). To prove this theorem, we use the following identity:
(2.6) |
(2.7) |
Here, the term on the right-hand side of (2.6) is the main term on the right-hand side of (2.5), up to a normalization (σuj Juj)−l. Also, we show that the first term in (2.7) is OP (δn) since satisfies (2.1). Moreover, using arather standard theory of Z-estimators, we show that . This in turn allows us to show with the help of empirical process arguments that the difference of the last two terms in (2.7) is OP (δn) as well. (In [15], we also point out that this difference is OP (δn) under much weaker entropy conditions than those in Assumption 2.2 if and are obtained using separate samples.) Thus, it remains to show that the left-hand side of (2.6) is equal to the left-hand side of (2.5) up to an approximation error OP (δn) and up to a normalization (σui juj)−1. To do so, we use second-order Taylor’s expansion of the function
at r = 1 around r = 0. This gives
for some . Here, follows from Assumptions 2.1 and 2.2 and the key near-orthogonality condition also allows us to show that . Without this condition, the term would give first-order bias and lead to slower-than- rate of convergence of the estimator . Finally, again using the empirical process arguments, we can show that all the bounds including the term OP (δn) hold uniformly over and .
Comment 2.2 (On uniformity in u in Theorem 2.1). When the functions are Lipschitz-continuous, one can use a simple discretization argument to conclude that the approximation in (2.5) holds uniformly over as long as we can show that it holds for each . However, in many applications, including the distribution regression model discussed in Section 3, this function is actually not continuous, and the location of jumps depends on the data. Therefore, we have to rely on a more complicated argument to establish uniformity in u in the bound (2.5).
The uniform Bahadur representation derived in Theorem 2.1 is useful for the construction of simultaneous confidence bands for as in (1.3). For this purpose, we apply new high-dimensional central limit and bootstrap theorems that have been recently developed in a sequence of papers [16, 18–20] and [21]. To apply these theorems, we make use of the following regularity condition.
Let be a sequence of positive constants converging to zero. Also, let (ϱn)n≥1, , (An)n≥1, , and (Ln)n≥1 be some sequences of positive constants, possibly growing to infinity, where ϱn ≥ 1, An ≥ n, and for all n ≥ 1. In addition, from now on, we assume that q > 4. Denote by an estimator of , with and being suitable estimators of Juj and σuj.
Assumption 2.3 (Additional score regularity). For all n ≥ n0 and , the following conditions hold: (i) The function class is suitably measurable and its uniform entropy numbers obey
where F0 is a measurable envelope for that satisfies ∥F0∥P,q ≤ Ln. (ii) For all and k = 3,4, we have . (iii) The function class satisfies with probability for all 0 < ϵ ≤ 1 and for all .
This assumption is technical, and its verification in applications is rather standard. For the Gaussian approximation result below, we actually only need the first and the second part of this assumption. The third part will be needed for establishing validity of the simultaneous confidence bands based on the multiplier bootstrap procedure. As a side note, observe that Assumption 2.3 allows to bound defined in (2.3) and used in Assumptions 2.1 and 2.2; see Appendix G of the Supplementary Material.
Next, let denote a tight zero-mean Gaussian process indexed by with covariance operator given by for and . We have the following corollary of Theorem 2.1, which is our second main result in this paper.
Corollary 2.1 (Gaussian approximation). Suppose that Assumptions 2.1, 2.2 and 2.3(i,ii) hold. In addition, suppose that thefollowing growth conditions are satisfied: , and . Then
uniformly over .
Based on Corollary 2.1, we are now able to construct simultaneous confidence bands for θuj’s as in (1.3). In particular, we will use the Gaussian multiplier bootstrap method employing the estimates of . To describe the method, define the process
(2.8) |
where are independent standard normal random variables which are independent from the data . Then the multiplier bootstrap critical value cα is defined as the (1 − α) quantile of the conditional distribution of given the data . To prove validity of this critical value for the construction of simultaneous confidence bands of the form (1.3), we will impose the following additional assumption. Let (εn)n≥1 be a sequence of positive constants converging to zero.
Assumption 2.4 (Variation estimation). For all n ≥ n0 and ,
The following corollary establishing validity of the multiplier bootstrap critical value ca for the simultaneous confidence bands construction is our third main result in this paper.
Corollary 2.2 (Simultaneous confidence bands). Suppose that Assumptions 2.1–2.4 hold. In addition, suppose that the growth conditions of Corollary 2.1 hold. Finally, suppose that εn ϱn log An = o(1), and . Then (1.3) holds uniformly over .
Comment 2.3 (Confidence bands based on other bootstrap schemes). Results in [24] suggest that the conditions of Corollary 2.2 can be somewhat relaxed if, instead of using the Gaussian weights in the multiplier bootstrap method, we use Mammen’s weights as in [34] or if we use the empirical bootstrap instead of the multiplier bootstrap. Since the results in [24] apply only to high-dimensional random vectors and do not apply to infinite-dimensional random processes, we leave a formal discussion of the results under these alternative bootstrap schemes to future work.
2.2. Construction of score functions satisfying the orthogonality condition.
Here, we discuss several methods for generating orthogonal scores in a wide variety of settings, including the classical Neyman’s construction. In what follows, since the argument applies to each u and j, it is convenient to omit the indices u and j and also to use the subscript 0 to indicate the true values of the parameters. For simplicity, we also focus the discussion on the exactly orthogonal case. With these simplifications, we can restate the orthogonality condition as follows: we say that the score ψ obeys the Neyman orthogonality condition with respect to if the following conditions hold: The Gateaux derivative map
exists for all and and vanishes at , namely,
(2.9) |
for all .
(1) Orthogonal scores for likelihood problems with finite-dimensional nuisance parameters. In likelihood settings with finite-dimensional parameters, the construction of orthogonal equations was proposed by Neyman [37] who used them in construction of his celebrated C(α)-statistic.1
To describe the construction, suppose that the log-likelihood function associated to observation W is (θ, β) ↦ ℓ(W, θ, β), where is the target parameter and is the nuisance parameter. Under regularity conditions, the true parameter values θ0 and β0 obey
(2.10) |
Now consider the new score function
(2.11) |
where the nuisance parameter is
μ is the d × p0 orthogonalization parameter matrix whose true value μ0 solves the equation
And
Provided that μ0 is well defined, we have by (2.10) that EP[ψ(W, θ0,η0)] = 0, where . Moreover, it is trivial to verify that under standard regularity conditions the score function ψ obeys the near orthogonality condition (2.2) exactly (i.e., with C0 = 0), that is,
Note that in this example, μ0 not only creates the necessary orthogonality but also creates the efficient score for inference on the main parameter θ, as emphasized by Neyman.
(2) Orthogonal scores for likelihood problems with infinite-dimensional nuisance parameters. The Neyman’s construction can be extended to semi-parametric models, where the nuisance parameter β is a function. In this case, the original score functions (θ, β) ↦ ∂θℓ(W, θ, β) corresponding to the log-likelihood function (θ, β) ↦ ℓ(W, θ, β) associated to observation W can be transformed into efficient score functions ψ that obey the exact orthogonality condition (2.9) by projecting the original score functions onto the orthocomplement of the tangent space induced by the nuisance parameter β; see Chapter 25 of [44] for a detailed description of this construction. Note that the projection may create additional nuisance parameters, so that the new nuisance parameter η could be of larger dimension than β. Other relevant references include [9, 29, 45] and [11]. The approach is related to Neyman’s construction in the sense that the score ψ arising in this model is actually the Neyman’s score arising in a one-dimensional least favorable parametric subfamily, [42]; see Chapter 25 of [44] for details.
(3) Orthogonal scores for conditional moment problems with infinite-dimensional nuisance parameters. Next, consider a conditional moment restrictions framework studied by Chamberlain [14]. To define the framework, let W, D and V be random vectors in , and , respectively, with D and V being subvectors of W, so that dD + dV ≤ dW. Also, let be a finite-dimensional parameter whose true value θ0 is of interest, and let be a vectorvalued functional nuisance parameter, with the true value being . The conditional moment restrictions framework assumes that θ0 and h0 satisfy the following equation:
(2.12) |
where is some known function. This framework is of interest because it covers an extremely rich variety of models, without having to explicitly rely on the likelihood formulation. For example, it covers the partial linear model
(2.13) |
where Y is a scalar dependent random variable, D is a scalar independent treatment random variable, V is a vector of control random variables and U is a scalar unobservable noise random variable. Indeed, (2.13) implies (2.12) by setting W = (Y, D,V′)′ and m(W, θ, h) = Y − Dθ − h.
Here, we would like to build a (generalized) score function (θ, η) ↦ ψ(W, θ, η) for estimating θ0, the true value of parameter θ, where η is a new nuisance parameter with true value η0, that obeys the near orthogonality condition (2.2). There are many ways to do so but one particularly useful way is the following. Consider the functional parameters and whose true values are given by
where
Then set η = (h, φ, Σ) and η0 = (h0,φ0,Σ0), and define the score function:
It is rather straightforward to verify that under mild regularity conditions, the score function ψ satisfies the moment condition, EP[ψ(W, θ0,η0)] = 0, and in addition, the orthogonality condition:
Note that this construction gives the efficient score function ψ that yields an estimator of θ0 achieving the semiparametric efficiency bound, as calculated by Chamberlain [14].
3. Application to logistic regression model with functional response data.
In this section, we apply our main results to a logistic regression model with functional response data described in the Introduction.
3.1. Model.
We consider a response variable that induces a functional response by for a set of indices and some constants . We are interested in the dependence of this functional response on a -vector of covariates, , controlling for a p-vector of additional covariates . We allow both and p to be (much) larger than the sample size of available data, n.
For each , we assume that Yu satisfies the generalized linear model with the logistic link function
(3.1) |
where is a vector of parameters that are of interest, βu = (βu1,…,βup)′ is a vector of nuisance parameters, ru = ru(D, X) is an approximation error, is the logistic link function defined by
and is the distribution of the triple W = (Y,D,X). As in the previous section, we construct simultaneous confidence bands for the parameters based on a random sample from the distribution of W = (Y, D,X).
3.2. Orthogonal score functions.
Before setting up score functions that satisfy both the moment conditions (1.2) and the orthogonality condition (1.4), observe that “naive” score functions that follow directly from the model (3.1),
where , satisfy the moment conditions EP[muj(W,θuj)] = 0 but violate the orthogonality condition (1.4) [with muj replacing ψuj and ]. To satisfy the orthogonality condition (1.4), we proceed using an approach from Section 2.2 as in the Introduction. Specifically, for each and , define the -vector of additional nuisance parameters by (1.5) where is defined in (1.6). Thus, by the first-order condition of (1.5), the nuisance parameters satisfy
(3.2) |
Also, denote . Then we set
where the nuisance parameter is . As we formally demonstrate in the proof of Theorem 3.1 below, this function satisfies the near-orthogonality condition (1.4).
3.3. Estimation using orthogonal score functions.
Next, we discuss estimation of ηuj’s and θuj’s. First, we assume that the approximation error ru = ru (D, X) is asymptotically negligible, so that it can be estimated by , the identically zero function of D and X. Second, for , we consider an estimator defined as a post-regularization weighted least squares estimator corresponding to the problem (1.5). Third, for , we consider a plug-in estimator , where and are suitable estimators of θu and βu. In particular, we assume that and are post-regularization maximum likelihood estimators corresponding to the log-likelihood function (θ, β) ↦ −Mu (W, θ, β) where
(3.3) |
The details of the estimators , and are given in Algorithm 1 below. The results in this paper can also be easily extended to the case where , and are replaced by penalized maximum likelihood estimators and and penalized weighted least squares estimator , respectively.
Then our estimator of ηuj is . Substituting this estimator into the score function ψuj gives
(3.4) |
which, using the sample analog (2.1) of the moment conditions (1.2), gives the following estimator of θuj:
(3.5) |
The algorithm is summarized as follows.
ALGORITHM 1. For each and :
3.4. Regularity conditions.
Next, we specify our regularity conditions. For all and , denote . Also, denote . Let q, c1 and C1 be some strictly positive (and finite) constants where q > 4. Moreover, let (δn)n≥1 and (Δn)n≥1 be some sequences of positive constants converging to zero. Finally, let (Mn,1)n≥1 and (Mn,2)n≥1 be some sequences of positive constants, possibly growing to infinity, where Mn,1 ≥ 1 and Mn,2 ≥ 1 for all n.
Assumption 3.1 (Parameters). For all , we have and . In addition, for all , we have . Finally, for all and , Θuj contains a ball of radius (log log n) (log an) 3/2/n1/2 centered at θuj.
Assumption 3.2 (Sparsity). There exist s = sn and , and , such that for all , and .
Assumption 3.3 (Distribution of Y). The conditional pdf of Y given (D, X) is bounded by C1.
Assumptions 3.1–3.3 are mild and standard in the literature. In particular, Assumption 3.1 requires the parameter spaces Θuj to be bounded, and also requires that for each and , the parameter θuj to be sufficiently separated from the boundaries of the parameter space Θuj. Assumption 3.2 requires approximate sparsity of the model (3.1). Note that in Assumption 3.2, given that ’s exist, we can and will assume without loss of generality that for some with |T| ≤ sn, where is allowed to depend on u and j. Here, the -vector is defined from by keeping all components of that are in T and setting all other components to be zero. Assumption 3.3 can be relaxed at the expense of more technicalities.
Assumption 3.4 (Covariates). For all , the following inequalities hold: (i) , (ii) , and (iii) . In addition, we have that (iv) , (v) (vi) , (vii) , (viii) and (ix) .
This assumption requires that there is no multicollinearity between covariates in vectors D and X. In addition, it requires that the constants and are chosen so that the probabilities of and are both nonvanishing since otherwise we would have vanishing either for u = 0 or u = 1 violating Assumption 3.4(i). Intuitively, sending and to the left and to the right tails of the distribution of Y, respectively, would blow up the variance of the estimators , given by in Theorem 2.1, and leading eventually to the estimators with slower-than- rate of convergence. Although our results could be extended to allow for the case where and are sent to the tails of the distribution of Y slowly, we skip this extension for the sake of clarity. Moreover, Assumption 3.4 imposes constraints on various moments of covariates. Since these constraints might be difficult to grasp, at the end of this section, in Corollary 3.3, we provide an example for which these constraints simplify into easily interpretable conditions.
Assumption 3.5 (Approximation error). For all , we have (i) , (ii) , (iii) , and (iv) almost surely. In addition, with probability , (v) .
This assumption requires the approximation error ru = ru(D, X) to be sufficiently small. Under Assumption 3.4, the first condition of Assumption 3.5 holds if the approximation error is such that almost surely for some constant C.
3.5. Formal results.
Under specified assumptions, our estimators satisfy the following uniform Bahadur representation theorem.
Theorem 3. 1 (Uniform Bahadur representation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all . In addition, suppose that the following growth condition holds: . Then for the estimators satisfying (3.5), we have
(3.6) |
in uniformly over , where , , and Juj is defined in (2.3).
This theorem allows us to establish a Gaussian approximation result for the supremum of the process :
Corollary 3.1 (Gaussian approximation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all . In addition, suppose that the following growth conditions hold: , and . Then
uniformly over , where is a tight zero-mean Gaussian process indexed by with the covariance given by for u, and j, .
Based on this corollary, we are now able to construct simultaneous confidence bands for the parameters θuj. Observe that
and so it can be estimated by
In addition, , and so it can be estimated by
Moreover, as in Section 2, define , and let cα be the (1 – α) quantile of the conditional distribution of given the data where the process is defined in (2.8). Then we have the following.
Corollary 3.2 (Simultaneous confidence bands for logistic model). Suppose that Assumptions 3.1–3.5 hold for all . In addition, suppose that the following growth conditions hold: , , and sn log3 an = o(n). Then (1.3) holds uniformly over .
To conclude this section, we provide an example for which conditions of Corollary 3.2 are easy to interpret. Recall that .
Corollary 3.3 (Uniform confidence bands for logistic regression model under simple conditions). Suppose that Assumptions 3.1–3.3, 3.4(i,ii,iv) and 3.5(i,ii,iv,v) hold for q > 4for all . In addition, suppose that and . Moreover, suppose that the following growth conditions hold: log7 an /n = o(1), and . Then (1.3) holds uniformly over .
Comment 3.1 (Estimation of variance). When constructing the confidence bands based on (1.3), we find in simulations that it is beneficial to replace the estimators of by max where is an alternative consistent estimator of .
Comment 3.2 (Alternative implementations, double selection). We note that the theory developed here is applicable for different estimators that construct the new score function with the desired orthogonality condition implicitly. For example, the double selection idea yields an implementation of an estimator that is first-order equivalent to the estimator based on the score function. The algorithm yielding the double selection estimator is as follows.
ALGORITHM 2. For each and :
Step 1′. Run post-ℓ1-penalized logistic estimator (4.2) of Yu on D and X to compute .
Step 2′. Define the weights .
Step 3′. Run the Lasso estimator (4.4) of on to compute .
Step 4′. Run logistic regression of Yu on Dj and all the selected variables in Steps 1′ and 3′ to compute .
As mentioned by a referee, it is surprising that the double selection procedure has uniform validity. The use of the additional variables selected in Step 3′, through the first-order conditions of the optimization problem, induces the necessary nearorthogonality condition. We refer to the Supplementary Material for a more detailed discussion.
Comment 3.3 (Alternative implementations, one-step correction). Another implementation for which the theory developed here applies is to replace Step 5 in Algorithm 1 with a one-step procedure. This relates to the debiasing procedure proposed in [43] to the case when the set is a singleton. In this case, instead of minimizing the criterion (3.5) in Step 5, the method makes a full Newton step from the initial estimate,
Step 5″. Compute .
The theory developed here directly apply to those estimators as well.
Comment 3.4 (Extension to other approximately sparse generalized linear models). Inspecting the proofs of Theorem 3.1 and Corollaries 3.1–3.3 reveal that these results can be extended with minor modifications to cover other approximately sparse generalized linear models. For example, the results can be extended to cover the model (3.1) where we use the probit link function instead of the logit link function Λ.
4. ℓ1-Penalized M-estimators: Nuisance functions and functional data.
In this section, we define the estimators , and , which were used in the previous section, and study their properties. We consider the same setting as that in the previous section. The results in this section rely upon a set of new results for ℓ1-penalized M-estimators with functional data presented in Appendix M of the Supplementary Material.
4.1. ℓ1-Penalized logistic regression for functional response data: Asymptotic properties.
Here, we consider the generalized linear model with the logistic link function and functional response data (3.1). As explained in the previous section, we assume that and are post-regularization maximum likelihood estimators of θu and βu corresponding to the log-likelihood function Mu(W, θ, β) =Mu(Yu, D, X, θ, ß) defined in (3.3). To define these estimators, let and be ℓl-penalized maximum likelihood (logistic regression) estimators
(4.1) |
where λ is a penalty level and a diagonal matrix of penalty loadings. We choose parameters λ and according to Algorithm 3 described below. Using the ℓ1 - penalized estimators and , we then define post-regularization estimators and by
(4.2) |
We derive the rate of convergence and sparsity properties of and as well as of and in Theorem 4.1 below. Recall that .
Algorithm 3 (Penalty level and loadings for logistic regression). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define with Nn = n. To select , choose a constant as an upper bound on the number of loops and proceed as follows: (0) Let , m = 0 and initialize for . (1) Compute and based on . (2) Set . (3) If , report the current value of and stop; otherwise set m ← m + 1 and go to step (1).
Theorem 4.1 (Rates and sparsity for functional response under logistic link). Suppose that Assumptions 3.1–3.5 hold for all . In addition, suppose that the penalty level λ and the matrices of penalty loadings are chosen according to Algorithm 3. Moreover, suppose that the following growth condition holds: . Then there exists a constant such that uniformly over all with probability 1 − o(1),
and the estimators and are uniformly sparse: . Also, uniformly overall , with probability 1 − o(1),
4.2. Lasso with estimated weights: Asymptotic properties.
Here, we consider the weighted linear model (3.2) for and . Using the parameter appearing in Assumption 3.2, it will be convenient to rewrite this model as
(4.3) |
where is an approximation error, which is asymptotically negligible under Assumption 3.2. As explained in the previous section, we assume that is a post-regularization weighted least squares estimator of (or ). To define this estimator, let be an ℓ1-penalized (weighted Lasso) estimator
(4.4) |
where λ and are the associated penalty level and the diagonal matrix of penalty loadings specified below in Algorithm 4 and where ’s are estimated weights. As in Algorithm 1 in the previous section, we set . Using , we define a post-regularized weighted least squares estimator:
(4.5) |
We derive the rate of convergence and sparsity properties of as well as of in Theorem 4.2 below.
Algorithm 4 (Penalty level and loadings for weighted Lasso). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define with . To select , choose a constant as an upper bound on the number of loops and proceed as follows: (0) Set m = 0 and . (1) Compute and based on . (2) Set . (3) If , report the current value of and stop; otherwise set m ← m + 1 and go to step (1).
Theorem 4.2 (Rates and sparsity for Lasso with estimated weights). Suppose that Assumptions 3.1–3.5 hold for all . In addition, suppose that the penalty level λ and the matrices of penalty loadings are chosen according to Algorithm 4. Moreover, suppose that the following growth condition holds: . Then there exists a constant such that uniformly over all with probability 1 − o(1),
and the estimator is uniformly sparse, . Also, uniformly over all , with probability 1 − o(1),
5. Monte Carlo simulations.
Here, we investigate the finite sample properties of the proposed estimators and the associated confidence regions. We report only the performance of the estimator based on the double selection procedure due to space constraints and note that it is very similar to the performance of the estimator based on score functions with near-orthogonality property. We will compare the proposed procedure with the traditional estimator that refits the model selected by the corresponding ℓ1-penalized M-estimator (naive post-selection estimator).
We consider a logistic regression model with functional response data where the response Yu = 1{y ≤ u} for a compact set. We specify two different designs: (1) a location model, y = x′ β0 + ξ, where ξ is distributed as a logistic random variable, the first component of x is the intercept and the other p − 1 components are distributed as N (0, Σ) with Έk,j = |0.5||k−j|; (2) a location-shift model, y = {(x′ β0 + ξ)/x′ϑ0}3, where ξ is distributed as alogistic random variable, xj = |wj| where w is a p-vector distributed as N (0, Σ) with Σk,j = |0.5||k−j|, and ϑ0 has nonnegative components. Such specification implies that for each :
Design1: θu = u(1,0,…,0)′ − β0 and Design 2: θu = u1/3ϑ0 − β0. In our simulations, we will consider n = 500 and p = 2000. For the location model (Design 1), we will consider two different choices for β0: (i) for j =1,…,p, and (ii) for j > 1 with the intercept coefficient . [These choices ensure maxj>1 |β0j | = 2 and that y is around zero in Design 2(ii).] We set . For Design 1, we have and for Design 2 we have . The results are based on 500 replications (the bootstrap procedure is performed 5000 times for each replication).
We report the (empirical) rejection frequencies for confidence regions with 95% nominal coverage, so that 0.05 is the target rejection frequency. We report the rejection frequencies for the proposed estimator and the post-naive selection estimator. Table 1 presents the performance of the methods when applied to construct a confidence interval for a single parameter ( and is a singleton). Since the setting is not symmetric, we investigate the performance for different components. Specifically, we consider {u} × {j} for j = 1,…, 5. First, consider the location model (Design 1). The difference between the performance of the naive estimator for Design 1(i) and 1(ii) highlights its fragile performance which is highly dependent on the unknown parameters. We can see from Table 1 that in Design 1(i) the Naive method achieve (pointwise) rejection frequencies up to 0.162 when the nominal level is 0.05. In Design 1(ii), it can be as high as 0.886. We also note that it is important to look at the performance of each component and avoid averaging across components (large j components are essentially not in the model, indeed for j > 50 we obtain rejection frequencies very close to 0.05 regardless of the model selection procedure). In contrast, the proposed estimator exhibits a much more robust behavior. For Design 1(i), the rejection frequencies are between 0.040 and 0.062 while for Design 1(ii) the rejection frequencies of the proposed estimator are between 0.040 and 0.056.
Table 1.
p = 2000, n = 500 | Rejection frequencies for j ∈{1,…, 5} | |||||
---|---|---|---|---|---|---|
Design | Method | j = 1 | j = 2 | j = 3 | j = 4 | j = 5 |
1(i) | Proposed | 0.042 | 0.040 | 0.062 | 0.050 | 0.044 |
Naive | 0.100 | 0.098 | 0.108 | 0.108 | 0.162 | |
1(ii) | Proposed | 0.044 | 0.040 | 0.054 | 0.056 | 0.056 |
Naive | 0.038 | 0.030 | 0.070 | 0.886 | 0.698 | |
2(i) | Proposed | 0.046 | 0.054 | 0.044 | 0.052 | 0.054 |
Naive | 0.046 | 0.050 | 0.038 | 0.070 | 0.054 | |
2(ii) | Proposed | 0.092 | 0.074 | 0.034 | 0.088 | 0.082 |
Naive | 0.034 | 0.972 | 0.182 | 0.312 | 0.916 |
Table 2 presents the performance for simultaneous confidence bands of the form where is a point estimate, is an estimate of the pointwise standard deviation and cv is a critical value that accounts for the uniform estimation. For the point estimate, we consider the proposed estimator and the post-naive selection estimator which have estimates of standard deviation. We consider two critical values: from the multiplier bootstrap (MB) procedure and the Bonferroni (BF) correction (which we expect to be conservative). For each of the four different designs [1(i), 1(ii), 2(i) and 2(ii) described above], we consider four different choices of . Table 2 displays rejection frequencies for confidence regions with 95% nominal coverage (and again 0.05 would be the ideal performance). The simulation results confirms the differences between the performance of the methods and overall the proposed procedure is closer to the nominal value of 0.05. The proposed estimator performed within a factor of two to the nominal value in 10 out of the 16 designs considered (and 13 out 16 within a factor of three). The post-naive selection estimator performed within a factor of two only in 3 out of the 16 designs when using the multiplier bootstrap as critical value (7 out of 16 within a factor of three) and similarly with the Bonferroni correction as the critical value.
Table 2.
p = 2000, n = 500 | Uniform over | ||||
---|---|---|---|---|---|
Design | Method | [1,2.5] × {1} | {1} × [10] | [1,2.5] × [10] | {1} × [1000] |
1(i) | Proposed | 0.054 | 0.036 | 0.048 | 0.040 |
Naive (MB) | 0.126 | 0.136 | 0.172 | 0.032 | |
Naive (BF) | 0.014 | 0.124 | 0.026 | 0.032 | |
1(ii) | Propose | 0.270 | 0.036 | 0.032 | 0.142 |
Naive (MB) | 0.014 | 0.802 | 0.934 | 0.404 | |
Naive (BF) | 0.000 | 0.802 | 0.718 | 0.376 | |
Design | Method | [−0.5, 0.5] × {1} | {0.5} × [10] | [−0.5,0.5] × [10] | {0.5} × [1000] |
2(i) | Proposed | 0.364 | 0.038 | 0.052 | 0.062 |
Naive (MB) | 0.116 | 0.040 | 0.022 | 0.048 | |
Naive (BF) | 0.018 | 0.038 | 0.000 | 0.046 | |
2(ii) | Proposed | 0.140 | 0.090 | 0.408 | 0.084 |
Naive (MB) | 0.002 | 0.946 | 0.996 | 0.362 | |
Naive (BF) | 0.000 | 0.946 | 0.944 | 0.298 |
Supplementary Material
APPENDIX A: PROOFS FOR SECTION 2
In this appendix, we use C to denote a strictly positive constant that is independent of n and . The value of C may change at each appearance. Also, the notation an ≲ bn means that an ≤ Cbn for all n and some C. The notation an ≳ bn means that bn ≲ an. Moreover, the notation an = o(1) means that there exists a sequence (bn)n ≥ 1 of positive numbers such that (i) |an| ≤ bn for all n, (ii) bn is independent of for all n and (iii) bn → 0 as n → ∞. Finally, the notation an = Op(bn) means that for all ϵ > 0, there exists C such that Pp(an > Cbn) ≤ 1 - ϵ for all n. Using this notation allows us to avoid repeating “uniformly over “ many times in the proofs of Theorem 2.1 and Corollaries 2.1 and 2.2. Throughout this appendix, we assume that n ≥ n0.
Proof of Theorem 2.1.
We split the proof into five steps.
Step 1. (Preliminary rate result). We claim that with probability 1 − o(1), . To show that, note that by definition of , we have for each and ,
which implies via the triangle inequality that uniformly over and , with probability 1 − o(1),
(A.1) |
where
and the bounds on I1 and I2 are derived in Step 2 [note also that ∊n = o(τn) by construction of the estimator and Assumption 2.2(vi)]. Since by Assumption 2.1(iv), does not exceed the left-hand side of (A.1), , and by Assumption 2.2(vi), B1nτn = o(1), we conclude that
(A.2) |
with probability 1 − o(1) yielding the claim of this step.
Step 2. (Bounds on I1 and I2). We claim that with probability 1 − o(1), I1 ≲ B1nτn and I2 < τn. To show these relations, observe that with probability 1 − o(1), we have I1 ≤ 2I1a + I1b and I2 ≤ I1a, where
To bound I1b, we employ Taylor’s expansion:
by Assumptions 2.1(v) and 2.2(ii).
To bound I1a, we apply the maximal inequality of Lemma P.2 to the class defined in Assumption 2.2 to conclude that with probability 1 − o(1),
(A.3) |
Here, we used: for all 0 < ∊ ≤ 1 with ‖F1‖P,q ≤ Kn by Assumption 2.2(iv); by Assumption 2.2(v); an ≥ n ∨ Kn and vn ≥ 1 by the choice of an and vn. In turn, the right-hand side of (A.3) is bounded from above by O(τn) by Assumption 2.2(vi) since and
Combining presented bounds gives the claim of this step.
Step 3. (Linearization). Here, we prove the claim of the theorem. Fix and . By definition of , we have
(A.4) |
Also, for any θ ∊ Θuj and , we have
(A.5) |
Moreover, by Taylor’s expansion of the function r ↦ EP [ψuj(W, θuj + r(θ − θuj), ηuj + r(η − ηuj))],
(A.6) |
for some . Substituting this equality into (A.5), taking and , and using (A.4) gives
(A.7) |
where
with and . It will be shown in Step 4 that
(A.8) |
In addition, it will be shown in Step 5 that
(A.9) |
Moreover, by construction of the estimator. Therefore, the expression in (A.7) is OP(δn). Also, by the near-orthogonality condition since for all and with probability 1 — o(1) by Assumption 2.2(i). Therefore, Assumption 2.1(iv) gives
The asserted claim now follows by dividing both parts of the display above by σuj (under the supremum on the left-hand side) and noting that σuj is bounded below from zero uniformly over and by Assumptions 2.2(iii) and 2.2(v).
Step 4. [Bounds on II1(u, j) and II2(u, j)]. Here, we prove (A.8). First, with probability 1 − o(1),
where the first inequality follows from Assumptions 2.1(v) and 2.2(i), the second from Step 1 and Assumptions 2.2(ii) and 2.2(vi) and the third from Assumption 2.2(vi).
Second, we have with probability 1 − o(1) that , where
for sufficiently large constant C. To bound , we applyLemmaP.2. Observe that
where we used Assumption 2.1(v) and Assumption 2.2(ii). Also, observe that (B1n τn)ω/2 ≥ n−ω/4 by Assumption 2.2(vi) since B1n ≥ 1. Therefore, an application of Lemma P.2 with an envelope F2 = 2F1 and σ = (CB1n τn)ω/2 for sufficiently large constant C gives with probability 1 − o(1),
(A.10) |
since and ‖F1‖P,q ≤ Kn by Assumption 2.2(iv) and
by Lemma O.1 because for defined in Assumption 2.2(iv). The claim of this step now follows from an application of Assumption 2.2(vi) to bound the right-hand side of (A.10).
Step 5. Here, we prove (A.9). For all and , let . Then since and Juj is bounded in absolute value below from zero uniformly over and by Assumption 2.1(iv). Therefore, for all and with probability 1 − o(1) by Assumption 2.1(i). Hence, with the same probability, for all and ,
and so it suffices to show that
(A.11) |
To prove (A.11), for given and , substitute and into (A.5) and use Taylor’s expansion in (A.6). This gives
where and are defined as II1 (u, j) and II2(u, j) in Step 3 but with replaced by . Then, given that with probability 1 − o(1), the argument in Step 4 shows that
In addition, by the definition of and by the near-orthogonality condition. Combining these bounds gives (A.11), so that the claim of this step follows, and completes the proof of the theorem.
APPENDIX B: REMAINING PROOFS FOR SECTION 2
See the Supplementary Material.
APPENDIX C: PROOFS FOR SECTIONS 3 AND 4
See the Supplementary Material.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement to “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework” (DOI: 10.1214/17-AOS1671SUPP; .pdf). The supplemental material contains additional proofs omitted in the main text, a discussion of the double selection method, a set of new results for ℓ1-penalized M-estimators with functional data, additional simulation results, and an empirical application.
Note that the C(α)-statistic, or the orthogonal score statistic, had been explicitly used for testing (and also for setting up estimation) in high-dimensional sparse models in [11] and in [39], where it is referred to as the decorrelated score statistic. The discussion of Neyman’s construction here draws on [23]. Note also that our results cover other types of orthogonal score statistics as well, which allows us to cover much broader classes of models; see, for example, the discussion of conditional moment models with infinite-dimensional nuisance parameters below.
REFERENCES
- [1].Andrews DWK (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62 43–72. [Google Scholar]
- [2].Belloni A, Chen D, Chernozhukov V and Hansen C (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429. [Google Scholar]
- [3].Belloni A and Chernozhukov V (2011). ℓ1-Penalized quantile regression for high dimensional sparse models. Ann. Statist 39 82–130. [Google Scholar]
- [4].Belloni A and Chernozhukov V (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. Available at arXiv:1001.0188. [Google Scholar]
- [5].Belloni A, Chernozhukov V, Chetverikov D and Wei Y (2018). Supplementto “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework.” DOI: 10.1214/17-AOS1671SUPP. [DOI] [PMC free article] [PubMed]
- [6].Belloni A, Chernozhukov V, Fernández-Val I and Hansen C (2013). Program evaluation with high-dimensional data. Available at arXiv:1311.2645.
- [7].Belloni A, Chernozhukov V and Hansen C (2010). Lasso methods for Gaussian instrumental variables models. Available at arXiv:1012.1297.
- [8].Belloni A, Chernozhukov V and Hansen C (2013). Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics. 10th World Congress of Econometric Society, August 2010, Vol. III. 245–295. Available at arXiv:1201.0220. [Google Scholar]
- [9].Belloni A, Chernozhukov V and Hansen C (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud 81 608–650. [Google Scholar]
- [10].Belloni A, Chernozhukov V and Kato K (2013). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Available at arXiv:1312.7186.
- [11].Belloni A, Chernozhukov V and Kato K (2015). Uniform post selection inference for LAD regression models and other Z-estimators. Biometrika 102 77–94. [Google Scholar]
- [12].Belloni A, Chernozhukov V and Wang L (2011). Square-root-lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. [Google Scholar]
- [13].Belloni A, Chernozhukov V and Wang L (2014). Pivotal estimation via square-root Lasso in nonparametric regression. Ann. Statist 42 757–788. [Google Scholar]
- [14].Chamberlain G (1992). Efficiency bounds for semiparametric regression. Econometrica 60 567–596. [Google Scholar]
- [15].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econom. J 21 C1–C68. [Google Scholar]
- [16].Chernozhukov V, Chetverikov D and Kato K (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist 41 2786–2819. [Google Scholar]
- [17].Chernozhukov V, Chetverikov D and Kato K (2014). Anti-concentration and honest, adaptive confidence bands. Ann. Statist 42 1787–1818. [Google Scholar]
- [18].Chernozhukov V, Chetverikov D and Kato K (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab 4 2309–2352. [Google Scholar]
- [19].Chernozhukov V, Chetverikov D and Kato K (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist 42 1564–1597. [Google Scholar]
- [20].Chernozhukov V, Chetverikov D and Kato K (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70. [Google Scholar]
- [21].Chernozhukov V, Chetverikov D and Kato K (2015). Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings. Available at arXiv:1502.00352.
- [22].Chernozhukov V, Fernández-Val I and Melly B (2013). Inference on counterfactual distributions. Econometrica 81 2205–2268. [Google Scholar]
- [23].Chernozhukov V, Hansen C and Spindler M (2015). Post-selection and post regularization inference in linear models with very many controls and instruments. Am. Econ. Rev. Pap. Proc 105 486–490. [Google Scholar]
- [24].Deng H and ZHANG C-H (2017). Beyond Gaussian approximation: Bootstrap for maxima of sums of independent random vectors. Available at arXiv:1705.09528.
- [25].Dudley R (1999). Uniform Central Limit Theorems Cambridge Studies in Advanced Mathematics 63. Cambridge Univ. Press, Cambridge. [Google Scholar]
- [26].Hothorn T, Kneib T and Bühlmann P (2014). Conditional transformation models.J. Roy. Statist. Soc. Ser. B 76 3–27. [Google Scholar]
- [27].Javanmard A and Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res 15 2869–2909. [Google Scholar]
- [28].Javanmard A and Montanari A (2014). Hypothesis testing in high-dimensional regression under the Gaussian random design model: Asymptotic theory. IEEE Trans. Inform. Theory 60 6522–6554. [Google Scholar]
- [29].Kosorok M (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin. [Google Scholar]
- [30].Leeb H and Pötscher B (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376. [Google Scholar]
- [31].Leeb H and Pötscher B (2008). Recent developments in model selection and related areas. Econometric Theory 24 319–322. [Google Scholar]
- [32].Leeb H and Pötscher BM (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econometrics 142 201–211. [Google Scholar]
- [33].Linton O (1996). Edgeworth approximation for MINPIN estimators in semiparametric regression models. Econometric Theory 12 30–60. Cowles Foundation Discussion Papers 1086 (1994). [Google Scholar]
- [34].Mammen E (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist 21 255–285. [Google Scholar]
- [35].Newey W (1990). Semiparametric efficiency bounds. J. Appl. Econometrics 5 99–135. [Google Scholar]
- [36].Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica 62 1349–1382. [Google Scholar]
- [37].Neyman J (1959). Optimal asymptotic tests of composite statistical hypotheses In Probability and Statistics: The Harald Cramér Volume (Grenander U, ed.) 213–234. Almqvist & Wiksell, Stockholm. [Google Scholar]
- [38].Neyman J (1979). c(α) tests and their use. Sankhyā 41 1–21. [Google Scholar]
- [39].Ning Y and Liu H (2014). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Available at arXiv:1412.8765.
- [40].Pötscher B and Leeb H (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. Multivariate Anal 100 2065–2082. [Google Scholar]
- [41].Robins J and Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc 90 122–129. [Google Scholar]
- [42].Stein C (1956). Efficient nonparametric testing and estimation In Proc. 3rd Berkeley Symp. Math. Statist. and Probab. 1 187–195. Univ. California Press, Berkeley, CA. [Google Scholar]
- [43].Van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
- [44].Van der Vaart A (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge. [Google Scholar]
- [45].Van der Vaart A and Wellner J (1996). Weak Convergence and Empirical Processes.
- [46].Zhang C-H and Zhang S (2014). Confidence intervals for low-dimensional parameters with high-dimensional data. J. Roy. Statist. Soc. Ser. B 76 217–242. [Google Scholar]
- [47].Zhao T, Kolar M and Liu H (2014). A general framework for robust testing and confidence regions in high-dimensional quantile regression. Available at arXiv:1412.8724.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.