Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 4.
Published in final edited form as: Ann Stat. 2018 Sep 11;46(6B):3643–3675. doi: 10.1214/17-AOS1671

UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK

Alexandre Belloni *, Victor Chernozhukov , Denis Chetverikov , Ying Wei §
PMCID: PMC6449050  NIHMSID: NIHMS1008366  PMID: 30956370

Abstract

In this paper, we develop procedures to construct simultaneous confidence bands for p˜ potentially infinite-dimensional parameters after model selection for general moment condition models where p˜ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the p˜ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for p˜n). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

Keywords: Inference after model selection, moment condition models with a continuum of target parameters, Lasso and Post-Lasso with functional response data

1. Introduction.

High-dimensional models have become increasingly popular in the last two decades. Much research has been conducted on estimation of these models. However, inference about parameters in these models is much less understood, although the literature on inference is growing quickly; see the list of references below. In this paper, we construct simultaneous confidence bands for many functional parameters parameters in a very general framework of moment condition models, where each parameter itself can be an infinite-dimensional object, and the number of such parameters can be much larger than the sample size of available data. Our paper builds upon [11], where simultaneous confidence bands have been constructed for many scalar parameters in a high-dimensional sparse z-estimation framework.

As a substantive application, we apply our general results to provide simultaneous confidence bands for parameters in a logistic regression model with functional response data

EP[Yu|D,X]=Λ(Dθu+Xβu),uU, (1.1)

where D=(D1,,Dp˜) is a p˜-vector of covariates whose effects are of interest, X = (X1,…,Xp)′ is a p-vector of controls, Λ: is the logistic link function, U=[0,1] is a set of indices and for each uU, Yu=1{Y(1u)y_+uy¯} for some constants y_y¯ and the response variable Y, θu1=(θu1,,θup˜) is a vector of target parameters and βu = (βu1,…,βup)′ is a vector of nuisance parameters. Here, both p˜ and p are allowed to be potentially much larger than the sample size n, and we have p˜ functional target parameters (θuj)uU and p functional nuisance parameters (βuj)uU. This example is important because it demonstrates that our methods can be used for inference about the whole distribution of the response variable Y given D and X in a high-dimensional setting, and not only about some particular features of it such as mean or median. This model is called a distribution regression model in [22] and a conditional transformation model in [26], who argue that the model provides a rich class of models for conditional distributions, and offers a useful generalization of traditional proportional hazard models as well as a useful alternative to quantile regression. We develop inference methods to construct simultaneous confidence bands for many functional parameters of this model in Section 3.

Toward this goal, our contributions include to effectively estimate a continuum of high-dimensional nuisance parameters, allow for approximately sparse models, control sparse eigenvalues of a continuum of random matrices, establish an approximate linearization for a collection of “orthogonalized” (or “de-biased”) estimators and establish the validity of a multiplier bootstrap for the construction of confidence bands for the many functional parameters of interest based on these estimators. In particular, these contributions build upon but go much beyond [11] (Corollary 4), which considers the special case of many scalar parameters in a z-estimation framework, and beyond [19] (Theorem 5.1), where simultaneous confidence bands are constructed via multiplier bootstrap for any large collection of approximately linear scalar estimator n(θ^jθj)=n1j=1pψj+rnj, j = 1,…,p, where the -norm of rn is oP(1/log p).

Our general results refer to the problem of estimating the set of parameters (θuj)uU,j[p˜] in the moment condition model,

Ep[ψuj(W,θuj,ηuj)]=0,uU,j[p˜], (1.2)

where W is a random element that takes values in a measurable space (W,AW) according to a probability measure P, Udu and [p˜]:={1,,p˜} are sets of indices, and for each uU and j[p˜], ψuj is a known score function, θuj is a scalar parameter of interest and ηuj is a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assuming that a random sample of size n, (Wi)i=1n, from the distribution of W is available together with suitable estimators η^uj of ηuj, we aim to construct simultaneous confidence bands for (θuj)uU,j[p˜] that are valid uniformly over a large class of probability measures P, say Pn. Specifically, for each uU and j[p˜], we construct an appropriate estimator θˇuj of θuj along with an estimator of the standard deviation of n(θˇujθuj),σ^uj, such that

 PP(θˇujcασ^ujnθujθˇuj+cασ^ujn,uU,j[p˜])1α, (1.3)

uniformly over PPn, where α ∈ (0, 1) and cα is an appropriate critical value, which we choose to construct using a multiplier bootstrap method. The left- and the right-hand sides of the inequalities inside the probability statement (1.3) then can be used as bounds in simultaneous confidence bands for θuj’s. In this paper, we are particularly interested in the case when p˜ is potentially much larger than n and U is an uncountable subset of du, so that for each j[p˜], (θuj)uU is an infinite-dimensional (i.e., functional) parameter.

In the presence of high-dimensional nuisance parameters, construction of valid confidence bands is delicate. Dealing with high-dimensional parameters requires relying upon regularization that leads to lack of asymptotic linearization of the estimators of target parameters since regularized estimators of nuisance parameters suffer from a substantial bias and this bias spreads into the estimators of the target parameters. This lack of asymptotic linearization in turn typically translates into severe distortions in coverage probability of the confidence bands constructed by traditional techniques that are based on perfect model selection; see [3032, 40]. To deal with this problem, we assume that the score functions ψuj are constructed to satisfy a near-orthogonality condition that makes them immune to first-order changes in the value of the nuisance parameter, namely

r{EP[ψuj(W,θuj,ηuj+rη˜)]}|r=00,uU,j[p˜], (1.4)

for all η˜ in an appropriate set where r denotes the derivative with respect to r. We shall often refer to this condition as Neyman orthogonality, since in low­dimensional parametric settings the orthogonality property originates in the work of Neyman on the C(α) test in the 50s. In Section 2 below, we describe a few general methods for constructing the score functions ψuj obeying the Neyman orthogonality condition.

The Neyman orthogonality condition (1.4) is important because it helps to make sure that the bias from the estimators of the high-dimensional nuisance parameters does not spread into the estimators of the target parameters. In particular, under (1.4), it follows that

EP,W[ψuj(W,θuj,η^uj)]0,uU,j[p˜],

at least up to the first order, where the index W in EP,W [·] means that the expectation is taken over W only. This makes the estimators of the target parameters θuj immune to the bias in the estimators η^uj, which in turn improves their statistical properties and opens up the possibilities for valid inference.

As the framework (1.2) covers a broad variety of applications, it is instructive to revisit the logistic regression model with functional response data (1.1). To construct score functions ψuj that satisfy both the moment conditions (1.2) and the Neyman orthogonality condition (1.4) in this example, for uU and j[p˜], define a (p˜+p1)-vector of additional nuisance parameters

γuj=argminγp˜+p1EP[fu2{DjXjγ}2], (1.5)

where Xj=(D[p˜]\j,X),D[p˜]\j=(D1,,Dj1,Dj+1,,Dp˜), and

fu2=fu2(D,X)= VarP(Yu|D,X). (1.6)

Then, denoting W = (Y,D,X) and splitting θu into θuj and θu[p˜]\j=(θu1,,θuj1,θuj+1,,θu[p˜]), we set

ψuj(W,θuj,ηuj)={YuΛ(Djθuj+Xjβuj)}(DjXjγuj),

where ηuj=(βuj,γuj) and βuj=(θu[p˜]\j,βu). It is straightforward to see that these score functions ψuj satisfy the moment conditions (1.2) and to see that they also satisfy the Neyman orthogonality condition (1.4), observe that

β{EP[ψuj(W,θuj,β,γuj)]}|β=βuj=EP[fu2{DjXjγuj}(Xj)]=0,γ{EP[ψuj(W,θuj,βuj,γ)]}|γ=yiij=EP[{YuΛ(Dθu+Xβu)}(Xj)]=0,

where the first line by definition of fu2 and γuj since VarP (Yu | D, X) = Λ′ (Dθu + Xβu), and the second by (1.1). Because of this orthogonality condition, we can exploit the moment conditions (1.2) to construct regular, n-consistent, estimators of θuj even if nonregular, regularized or post-regularized, estimators of ηuj=(βuj,γuj) are used to cope with high-dimensionality. Using these regular estimators of θuj, we then can construct valid confidence bands (1.3).

Our general approach to construct simultaneous confidence bands, which is developed in Section 2, can be described as follows. First, we construct the moment conditions (1.2) that satisfy the Neyman orthogonality condition (1.4), and use these moment conditions to construct estimators θˇuj of θuj for all uU and j[p˜]. Second, under appropriate regularity conditions, we establish a Bahadur representation for θˇuj’s. Third, employing the Bahadur representation, we are able to derive a suitable Gaussian approximation for the distribution of θˇuj’s. Importantly, the Gaussian approximation is possible even if both p˜ and the dimension of the index set U, du, are allowed to grow with n, and p˜ asymptotically remains much larger than n. Finally, from the Gaussian approximation, we construct simultaneous confidence bands using a multiplier bootstrap method. Here, the Gaussian and bootstrap approximations are constructed by applying the results on high­dimensional central limit and bootstrap theorems established in [1621] by verifying the conditions there.

Although regularity conditions underlying our approach can be verified for many models defined by moment conditions, for illustration purposes, we explicitly verify these conditions for the logistic regression model with functional response data (1.1) in Section 3. We also note that the regularity conditions, in particular those related to the entropy of the nuisance parameter estimators, can be substantially relaxed if we use sample splitting, so that the nuisance parameters and parameters of interest are estimated on separate samples; see [15]. In addition, we examine the performance of the proposed procedures in a Monte Carlo simulation study and provide an example based on real data in Section 5. Moreover, in the Supplementary Material [5], we discuss the construction of simultaneous confidence bands based on a double-selection estimator. This estimator does not require to explicitly construct the score functions satisfying the Neyman orthogonality condition but nonetheless is first-order equivalent to the estimator based on such functions.

We also develop new results for 1-penalized M-estimators in Section 4 to handle functional data and criterion functions that depend on nuisance functions for which only estimates are available building on ideas in [3,4,12] (for brevity of the paper, generic results are deferred to Supplementary Material, and Section 4 only contains results that are relevant for the logistic regression model studied in Section 3). Specifically, we develop a method to select penalty parameters for these estimators and extend the existing theory to cover functional data to achieve rates of convergence and sparsity guarantees that hold uniformly over uU. The ability to allow both for functional data and for nuisance functions is crucial in the implementation and in theoretical analysis of the methods proposed in this paper.

Orthogonality conditions like that in (1.4) have played an important role in statistics and econometrics. In low-dimensional settings, a similar condition was used by Neyman in [37] and [38] while in semiparametric models the orthogonality conditions were used in [1, 35, 36, 41] and [33]. In high-dimensional settings, [7] and [2] were the first to use the orthogonality condition (1.4) in a linear instrumental variables model with many instruments. Related ideas have also been used in the literature to construct confidence bands in high-dimensional linear models, generalized linear models and other nonlinear models; see [6, 811, 13, 27,28,43, 46, 47] and [39], where we can interpret each procedure as implicitly or explicitly constructing and solving an approximately Neyman-orthogonal estimating equation. We contribute to this quickly growing literature by providing procedures to construct simultaneous confidence bands for many infinite-dimensional parameters identified by moment conditions.

Throughout the paper, we use the standard notation from the empirical process theory. In particular, we use En to denote the expectation with respect to the empirical measure associated with the data (Wi)i=1n, and we use Gn to denote the empirical process n(EnEP). More details about the notation are given in the Supplementary Material.

2. Confidence regions for function-valued parameters based on moment conditions.

2.1. Generic construction of confidence regions.

In this section, we state our results under high-level conditions. In the next section, we will apply these results to construct simultaneous confidence bands for many infinite-dimensional parameters in the logistic regression model with functional response data.

Recall that we are interested in constructing simultaneous confidence bands for a set of target parameters (θuj)uU,j[p˜] where for each uUdu and j[p˜]={1,,p˜}, the parameter θuj satisfies the moment condition (1.2) with ηuj being a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assume that θuj ∈ Θuj, a finite or infinite interval in , and that ηujTuj, a convex set in a normed space equipped with a norm ∥ · ∥e. We allow U to be a possibly uncountable set of indices, and p˜ to be potentially large.

We assume that a random sample (Wi)i=1n from the distribution of W is available for constructing the confidence bands. We also assume that for each uU and j[p˜], the nuisance parameter ηuj can be estimated by η^uj using the same data (Wi)i=1n. In the next section, we discuss examples where η^uj’s are based on Lasso or Post-Lasso methods (although other modern regularization and post­regularization methods can be applied). Our confidence bands will be based on the estimators θˇuj of θuj that are for each uU and j[p˜] defined as approximate ϵn-solutions in Θuj to sample analogs of the moment conditions (1.2), that is,

supuU,j[p˜]{|En[ψuj(W,θˇuj,η^uj)]|infθΘuj|En[ψuj(W,θ,η^uj)]|}ϵn, (2.1)

where ϵn = o(δnn−1/2) for all n ≥ 1 and some sequence (δn)n≥1 of positive constants converging to zero.

To motivate the construction of the confidence bands based on the estimators θˇuj, we first study distributional properties of these estimators. To do that, we will employ the following regularity conditions. Let C0 be a strictly positive (and finite) constant, and for each uU and j[p˜], let Tuj be some subset of Tuj, whose properties are specified below in assumptions. In particular, we will choose the sets Tuj so that, on the one hand, their complexity does not grow too fast with n but, on the other hand, for each uU and j[p˜], the estimator η^uj takes values in Tuj with high probability. As discussed before, we rely on the following near­orthogonality condition.

DEFINITION 2.1 (Near-orthogonality condition). For each uU and j[p˜], we say that ψuj obeys the near-orthogonality condition with respect to TujTuj if the following conditions hold: The Gateaux derivative map

Du,j,r¯[ηηuj]:=r{EP[ψuj(W,θuj,ηuj+r(ηηuj))]}|r=r¯

exists for all r¯[0,1) and ηTuj and (nearly) vanishes at r¯=0, namely,

|Du,j,0[ηηuj]|C0δnn1/2, (2.2)

for all ηTuj.

At the end of this section, we describe several methods to obtain score functions ψuj that obey the near-orthogonality condition. Together these methods cover a wide variety of applications.

Let ω and c0 be some strictly positive (and finite) constants, and let n0 ≥ 3 be some positive integer. Also, let (B1n)n≥1 and (B2n)n≥1 be some sequences of positive constants, possibly growing to infinity, where B1n ≥ 1 for all n ≥ 1. In addition, denote

Sn:=EP[supuU,j[p˜]|nEn[ψuj(W,θuj,ηuj)]],Juj:=θ{EP[ψuj(W,θ,ηuj)]}|θ=θuj. (2.3)

The quantity Sn measures how rich the process {ψuj(,θuj,ηuj):uU,j[p˜]} is. The quantity Juj measures the degree of identifiability of θuj by the moment condition (1.2). In many applications, it is bounded in absolute value from above and away from zero. Finally, let Pn be a set of probability measures P of possible distributions of W on the measurable space (W,AW).

We collect our main conditions on the score functions ψuj and the true values of the target parameters θuj in the following assumption.

Assumption 2.1 (Moment condition problem). For all nn0, PPn, uU, and j[p˜], the following conditions hold: (i) The true parameter value θuj obeys (1.2), and Θuj contains a ball of radius C0n1/2Sn log n centered at θuj. (ii) The map (θ,η) ↦ EP[ψuj(W,θ,η)] is twice continuously Gateaux-differentiable on Θuj×Tuj. (iii) The score function ψuj obeys the near­orthogonality condition given in Definition 2.1 for the set TujTuj. (iv) For all θ ∈ Θuj, |EP[ψuj(W,θ,ηuj)]| ≥ 2−1|Juj(θ - θuj)| ∧ c0, where Juj satisfies c0|Juj | ≤ C0. (v) For all r ∈ [0, 1), θ ∈ Θuj, and ηTuj:

  1. EP[(ψuj(W,θ,η)ψuj(W,θuj,ηuj))2]C0(|θθuj|ηηuje)ω,

  2. |rEP[ψuj(W,θ,ηuj+r(ηηuj))]|B1nηηuje,

  3. |r2EP[ψuj(W,θuj+r(θθuj),ηuj+r(ηηuj))]|B2n(|θθuj|2ηηuje2).

Assumption 2.1 is mild and standard in moment condition problems. Assumption 2.1 (i) requires θuj to be sufficiently separated from the boundary of Θuj. Assumption 2.1(ii) requires that the functions (θ,η) ↦ EP[ψuj(W,θ,η)] are smooth. It is a mild condition because it does not require smoothness of the functions (θ, η) ↦ ψuj(W,θ,η). Assumption 2.1(iii) is our key condition and is discussed above. Assumption 2.1(iv) implies sufficient identifiability of θuj. In particular, it implies that the equation EP[ψuj(W,θ,ηuj)] = 0 has only one solution θ = θuj. If this equation has multiple solutions, Assumption 2.1(iv) implies that the set Θuj is restricted enough so that there is only one solution in Θuj. Assumption 2.1(v-a) means that the functions (θ, η) ↦ ψuj(W,θ,η) mapping Θuj × Tuj into L2(P) are Lipschitz-continuous at (θ, η) = (θuj,ηuj) with Lipschitz order ω/2. In most applications, we can set ω = 2. Assumptions 2.1(v-b,v-c) impose smoothness bounds on the functions (θ, η) ↦ EP[ψuj(W,θ,η)].

Next, we state our conditions related to the estimators η^uj. Let (Δn)n≥1 and (τn)n≥1 be some sequences of positive constants converging to zero. Also, let (an)n≥1, (vn)n≥1, and (Kn)n≥1 be some sequences of positive constants, possibly growing to infinity, where annKn and vn ≥ 1 for all n ≥ 1. Finally, let q ≥ 2 be some constant.

Assumption 2.2 (Estimation of nuisance parameters). For all nn0 and PPn, the following conditions hold: (i) With probability at least 1 − Δn, we have η^ujTuj for all uU and j[p˜]. (ii) For all uU, j[p˜], and ηTuj, ∥ηηujeτn. (iii) For all uU and j[p˜], we have ηujTuj. (iv) The function class F1={ψuj(,θ,η):uU,j[p˜],θΘuj,ηTuj} is suitably measurable and its uniform entropy numbers obey

supQ log N(ϵF1Q,2,F1,Q,2)vn log(an/ϵ)  for all 0<ϵ1, (2.4)

where F1 is a measurable envelope for F1 that satisfies ∥F1P,qKn. (v) For all fF1, we have c0 ≤ ∥ fP,2C0. (vi) The complexity characteristics an and vn satisfy:

  1. (vn log an/n)1/2C0τn,

  2. (B1nτn+Sn log n/n)ω/2(vn log an)1/2+n1/2+1/qvnKn log anC0δn,

  3. n1/2B1n2B2nτn2C0δn.

Assumption 2.2 provides sufficient conditions for the estimation of the nuisance parameters (ηuj)uU,j[p˜]. Assumption 2.2 (i) requires that the set Tuj is large enough so that η^ujTuj with high probability. Assumptions 2.2 (i,ii) together require that the estimator η^uj converges to ηuj with the rate τn. This rate should be fast enough so that Assumptions 2.2(vi-b,vi-c) are satisfied. Assumption 2.2(iv) gives a bound on the complexity of the set Tuj expressed via uniform entropy numbers, and Assumptions 2.2(vi-a,vi-b) require that the set Tuj is small enough so that its complexity does not grow too fast. Assumption 2.2(v) requires that the functions (θ, η) ↦ ψuj(W,θ,η) are scaled properly. Suitable measurability of F1, required in Assumption 2.2(iv), is a mild condition that is satisfied in most practical cases; see the Supplementary Material and [25] for clarifications. Overall, Assumption 2.2 shows the trade-off in the choice of the sets Tuj: setting Tuj large, on the one hand, makes it easy to satisfy Assumption 2.2(i) but, on the other hand, yields large values of an and vn in Assumption 2.2(iv) making it difficult to satisfy Assumption 2.2(vi).

We stress that the class F1 does not need to be Donsker because its uniform entropy numbers are allowed to increase with n. This is important because allowing for non-Donsker classes is necessary to deal with high-dimensional nuisance parameters. Note also that our conditions are very different from the conditions imposed in various settings with nonparametrically estimated nuisance functions; see, for example, [44, 45] and [29].

In addition, we emphasize that the conditions stated in Assumption 2.2 are sufficient for our results for the general model (1.2) but can often be relaxed if the structure of the functions ψuj(W,θ,η) is known. For example, it is possible to relax Assumption 2.2(vi) if the functions ψuj(W,θ,η) are linear in θ, which happens in the linear regression model with θ being the coefficient on the covariate of interest; see [9]. Moreover, it is possible to relax the entropy condition (2.4) of Assumption 2.2 by relying upon sample splitting, where part of the data is used to estimate ηuj, and the other part is used to estimate θuj given an estimate η^uj of ηuj; see [2] and [15]. By swapping the role of two parts, and averaging the resulting two estimators, we do not incur any efficiency losses.

The following theorem is our first main result in this paper.

Theorem 2.1 (Uniform Bahadur representation). Under Assumptions 2.1 and 2.2, for an estimator (θˇuj)uU,j[p˜] that obeys (2.1), we have

nσuj1(θˇujθuj)=Gnψ¯uj+OP(δn) (2.5)

in (U×[p˜]) uniformly over PPn, where ψ¯uj():=σuj1Juj1ψuj(,θuj,ηuj) and σuj2:=Juj2EP[ψuj2(W,θuj,ηuj)].

Comment 2.1 (On the proof ofTheorem 2.1). To prove this theorem, we use the following identity:

nEP,W[ψuj(W,θˇuj,η^uj)ψuj(W,θuj,ηuj)]=nEn[ψuj(W,θuj,ηuj)] (2.6)
+nEn[ψuj(W,θˇuj,η^uj)]+Gnψuj(W,θuj,ηuj)Gnψuj(W,θˇuj,η^uj). (2.7)

Here, the term on the right-hand side of (2.6) is the main term on the right-hand side of (2.5), up to a normalization (σuj Juj)−l. Also, we show that the first term in (2.7) is OP (δn) since θˇuj satisfies (2.1). Moreover, using arather standard theory of Z-estimators, we show that θˇujθuj=OP(B1nτn). This in turn allows us to show with the help of empirical process arguments that the difference of the last two terms in (2.7) is OP (δn) as well. (In [15], we also point out that this difference is OP (δn) under much weaker entropy conditions than those in Assumption 2.2 if η^uj and θˇuj are obtained using separate samples.) Thus, it remains to show that the left-hand side of (2.6) is equal to the left-hand side of (2.5) up to an approximation error OP (δn) and up to a normalization (σui juj)−1. To do so, we use second-order Taylor’s expansion of the function

f(r)=nEP,W[ψuj(W,θuj+r(θˇujθuj),ηuj+r(η^ujηuj))]

at r = 1 around r = 0. This gives

nEP,W[ψuj(W,θˇuj,η^uj)ψuj(W,θuj,ηuj)]=f(1)f(0)=nJuj(θˇujθuj)+nDu,j,0[η^ujηuj]+nf(r¯)/2

for some r¯(0,1). Here, nf(r¯)=OP(δn) follows from Assumptions 2.1 and 2.2 and the key near-orthogonality condition also allows us to show that nDu,j,0[η^ujηuj]=OP(δn). Without this condition, the term nDu,j,0[η^ujηuj] would give first-order bias and lead to slower-than-n rate of convergence of the estimator θˇuj. Finally, again using the empirical process arguments, we can show that all the bounds including the term OP (δn) hold uniformly over uU and j[p˜].

Comment 2.2 (On uniformity in u in Theorem 2.1). When the functions unσuj1(θˇujθuj)Gnψ¯uj are Lipschitz-continuous, one can use a simple discretization argument to conclude that the approximation in (2.5) holds uniformly over (u,j)U×[p˜] as long as we can show that it holds for each (u,j)U×[p˜]. However, in many applications, including the distribution regression model discussed in Section 3, this function is actually not continuous, and the location of jumps depends on the data. Therefore, we have to rely on a more complicated argument to establish uniformity in u in the bound (2.5).

The uniform Bahadur representation derived in Theorem 2.1 is useful for the construction of simultaneous confidence bands for (θuj)uU,j[p˜] as in (1.3). For this purpose, we apply new high-dimensional central limit and bootstrap theorems that have been recently developed in a sequence of papers [16, 1820] and [21]. To apply these theorems, we make use of the following regularity condition.

Let (δ¯n)n1 be a sequence of positive constants converging to zero. Also, let (ϱn)n≥1, (ϱ¯n)n1, (An)n≥1, (A¯n)n1, and (Ln)n≥1 be some sequences of positive constants, possibly growing to infinity, where ϱn ≥ 1, Ann, and A¯nn for all n ≥ 1. In addition, from now on, we assume that q > 4. Denote by ψ^uj():=σ^uj1J^uj1ψuj(,θˇuj,η^uj) an estimator of ψ^uj(), with J^uj and σ^uj being suitable estimators of Juj and σuj.

Assumption 2.3 (Additional score regularity). For all nn0 and PPn, the following conditions hold: (i) The function class F0={ψ¯uj():uU,j[p˜]} is suitably measurable and its uniform entropy numbers obey

supQ log N(ϵF0Q,2,F0,Q,2)ϱnlog(An/ϵ)  for all 0<ϵ1,

where F0 is a measurable envelope for F0 that satisfies ∥F0P,qLn. (ii) For all fF0 and k = 3,4, we have EP[|f(W)|k]C0Lnk2. (iii) The function class F^0={ψ¯uj()ψ^uj():uU,j[p˜]} satisfies with probability 1Δn:log N(ϵ,F^0,n,2)ϱ¯nlog (A¯n/ϵ) for all 0 < ϵ ≤ 1 and fn,2δ¯n for all fF^0.

This assumption is technical, and its verification in applications is rather standard. For the Gaussian approximation result below, we actually only need the first and the second part of this assumption. The third part will be needed for establishing validity of the simultaneous confidence bands based on the multiplier bootstrap procedure. As a side note, observe that Assumption 2.3 allows to bound Sn defined in (2.3) and used in Assumptions 2.1 and 2.2; see Appendix G of the Supplementary Material.

Next, let (Nuj)uU,j[p˜] denote a tight zero-mean Gaussian process indexed by U×[p˜] with covariance operator given by EP[ψ¯uj(W)ψ¯uj(W)] for u,uU and j,j[p˜]. We have the following corollary of Theorem 2.1, which is our second main result in this paper.

Corollary 2.1 (Gaussian approximation). Suppose that Assumptions 2.1, 2.2 and 2.3(i,ii) hold. In addition, suppose that thefollowing growth conditions are satisfied: δn2ϱnlog An=o(1), Ln2/7ϱnlog An=o(n1/7) and Ln2/3ϱnlog An=o(n1/32/(3q)). Then

supt|PP (supuU,j[p˜]|nσuj1(θˇujθuj)|t) PP (supuU,j[p˜]|Nuj|t)|=o(1)

uniformly over PPn.

Based on Corollary 2.1, we are now able to construct simultaneous confidence bands for θuj’s as in (1.3). In particular, we will use the Gaussian multiplier bootstrap method employing the estimates ψ^uj of ψ¯uj. To describe the method, define the process

G^=(G^uj)uU,j[p˜]=(1ni=1nξiψ^uj(Wi))uU,j[p˜], (2.8)

where (ξi)i=1n are independent standard normal random variables which are independent from the data (Wi)i=1n. Then the multiplier bootstrap critical value cα is defined as the (1 − α) quantile of the conditional distribution of supuU,j[p˜]|G^uj| given the data (Wi)i=1n. To prove validity of this critical value for the construction of simultaneous confidence bands of the form (1.3), we will impose the following additional assumption. Let (εn)n≥1 be a sequence of positive constants converging to zero.

Assumption 2.4 (Variation estimation). For all nn0 and PPn,

PP (supuU,j[p˜]|σ^ujσuj1|>εn)Δn.

The following corollary establishing validity of the multiplier bootstrap critical value ca for the simultaneous confidence bands construction is our third main result in this paper.

Corollary 2.2 (Simultaneous confidence bands). Suppose that Assumptions 2.1–2.4 hold. In addition, suppose that the growth conditions of Corollary 2.1 hold. Finally, suppose that εn ϱn log An = o(1), and δ¯n2ϱ¯nϱn(log A¯n)(log An)=o(1). Then (1.3) holds uniformly over PPn.

Comment 2.3 (Confidence bands based on other bootstrap schemes). Results in [24] suggest that the conditions of Corollary 2.2 can be somewhat relaxed if, instead of using the Gaussian weights in the multiplier bootstrap method, we use Mammen’s weights as in [34] or if we use the empirical bootstrap instead of the multiplier bootstrap. Since the results in [24] apply only to high-dimensional random vectors and do not apply to infinite-dimensional random processes, we leave a formal discussion of the results under these alternative bootstrap schemes to future work.

2.2. Construction of score functions satisfying the orthogonality condition.

Here, we discuss several methods for generating orthogonal scores in a wide variety of settings, including the classical Neyman’s construction. In what follows, since the argument applies to each u and j, it is convenient to omit the indices u and j and also to use the subscript 0 to indicate the true values of the parameters. For simplicity, we also focus the discussion on the exactly orthogonal case. With these simplifications, we can restate the orthogonality condition as follows: we say that the score ψ obeys the Neyman orthogonality condition with respect to η0T if the following conditions hold: The Gateaux derivative map

Dr¯[ηη0]:=r{EP[ψ(W,θ0,η0+r(ηη0)]}|r=r¯

exists for all r¯[0,1) and ηT and vanishes at r¯=0, namely,

Du,j,0[ηη0]=0, (2.9)

for all ηT.

(1) Orthogonal scores for likelihood problems with finite-dimensional nuisance parameters. In likelihood settings with finite-dimensional parameters, the construction of orthogonal equations was proposed by Neyman [37] who used them in construction of his celebrated C(α)-statistic.1

To describe the construction, suppose that the log-likelihood function associated to observation W is (θ, β) ↦ (W, θ, β), where θΘd is the target parameter and βTp0 is the nuisance parameter. Under regularity conditions, the true parameter values θ0 and β0 obey

EP[θ(W,θ0,β0)]=0,EP[β(W,θ0,β0)]=0. (2.10)

Now consider the new score function

ψ(W,θ,η)=θ(W,θ,β)μβ(W,θ,β), (2.11)

where the nuisance parameter is

η=(β, vec (μ))T×Dp,p=p0+dp0,

μ is the d × p0 orthogonalization parameter matrix whose true value μ0 solves the equation

JθβμJββ=0( i.e., μ0=JθβJββ1),

And

J=(JθθJθβJβθJββ)=(θ,β)EP[(θ,β)(W,θ,β)]|θ=θ0;β=β0.

Provided that μ0 is well defined, we have by (2.10) that EP[ψ(W, θ00)] = 0, where η0=(β0, vec (μ0)). Moreover, it is trivial to verify that under standard regularity conditions the score function ψ obeys the near orthogonality condition (2.2) exactly (i.e., with C0 = 0), that is,

ηEP[ψ(W,θ0,η)]|η=η0=0.

Note that in this example, μ0 not only creates the necessary orthogonality but also creates the efficient score for inference on the main parameter θ, as emphasized by Neyman.

(2) Orthogonal scores for likelihood problems with infinite-dimensional nuisance parameters. The Neyman’s construction can be extended to semi-parametric models, where the nuisance parameter β is a function. In this case, the original score functions (θ, β) ↦ θ(W, θ, β) corresponding to the log-likelihood function (θ, β) ↦ (W, θ, β) associated to observation W can be transformed into efficient score functions ψ that obey the exact orthogonality condition (2.9) by projecting the original score functions onto the orthocomplement of the tangent space induced by the nuisance parameter β; see Chapter 25 of [44] for a detailed description of this construction. Note that the projection may create additional nuisance parameters, so that the new nuisance parameter η could be of larger dimension than β. Other relevant references include [9, 29, 45] and [11]. The approach is related to Neyman’s construction in the sense that the score ψ arising in this model is actually the Neyman’s score arising in a one-dimensional least favorable parametric subfamily, [42]; see Chapter 25 of [44] for details.

(3) Orthogonal scores for conditional moment problems with infinite-dimensional nuisance parameters. Next, consider a conditional moment restrictions framework studied by Chamberlain [14]. To define the framework, let W, D and V be random vectors in dW, dD and dV, respectively, with D and V being sub­vectors of W, so that dD + dVdW. Also, let θΘdθ be a finite-dimensional parameter whose true value θ0 is of interest, and let h:dVdh be a vector­valued functional nuisance parameter, with the true value being h0:dVdh. The conditional moment restrictions framework assumes that θ0 and h0 satisfy the following equation:

EP[m(W,θ0,h0(V))|D,V]=0, (2.12)

where m:dW×dθ×dhdm is some known function. This framework is of interest because it covers an extremely rich variety of models, without having to explicitly rely on the likelihood formulation. For example, it covers the partial linear model

Y=Dθ0+h0(V)+U, EP [U|D,V]=0, (2.13)

where Y is a scalar dependent random variable, D is a scalar independent treatment random variable, V is a vector of control random variables and U is a scalar unobservable noise random variable. Indeed, (2.13) implies (2.12) by setting W = (Y, D,V′)′ and m(W, θ, h) = Yh.

Here, we would like to build a (generalized) score function (θ, η) ↦ ψ(W, θ, η) for estimating θ0, the true value of parameter θ, where η is a new nuisance parameter with true value η0, that obeys the near orthogonality condition (2.2). There are many ways to do so but one particularly useful way is the following. Consider the functional parameters Σ:dD+dVdm×dm and φ:dD+dVdθ×dm whose true values are given by

Σ0(D,V)=EP[m(W,θ0,h0(V))m(W,θ0,h0(V))|D,V],φ0(D,V)=(A0(D,V)Γ0(D,V)G0(V)),

where

A0(D,V)=θ EP[m(W,θ,h0(V))|D,V]|θ=θ0,Γ0(D,V)=h EP[m(W,θ0,h)|D,V]|h=h0(V),G0(V)=(EP[Γ0(D,V)Σ0(D,V)1Γ0(D,V)|V])1×EP[Γ0(D,V)Σ0(D,V)1A0(D,V)|V].

Then set η = (h, φ, Σ) and η0 = (h000), and define the score function:

ψ(W,θ,η)=φ(D,V)"instrument"Σ(D,V)1 weight m(W,θ,h(V)) residual .

It is rather straightforward to verify that under mild regularity conditions, the score function ψ satisfies the moment condition, EP[ψ(W, θ0,η0)] = 0, and in addition, the orthogonality condition:

ηEP[ψ(W,θ0,η)]|η=η0=0.

Note that this construction gives the efficient score function ψ that yields an estimator of θ0 achieving the semiparametric efficiency bound, as calculated by Chamberlain [14].

3. Application to logistic regression model with functional response data.

In this section, we apply our main results to a logistic regression model with functional response data described in the Introduction.

3.1. Model.

We consider a response variable Y that induces a functional response (Yu)uU by Yu=1{Y(1u)y_+uy¯} for a set of indices U=[0,1] and some constants y_y¯. We are interested in the dependence of this functional response on a p˜-vector of covariates, D=(D1,,Dp˜)p˜, controlling for a p-vector of additional covariates X=(X1,,Xp)p. We allow both p˜ and p to be (much) larger than the sample size of available data, n.

For each uU, we assume that Yu satisfies the generalized linear model with the logistic link function

EP[Yu|D,X]=Λ(Dθu+Xβu)+ru, (3.1)

where θu1=(θu1,,θup˜) is a vector of parameters that are of interest, βu = (βu1,…,βup)′ is a vector of nuisance parameters, ru = ru(D, X) is an approximation error, Λ: is the logistic link function defined by

Λ(t)=exp(t)1+exp(t),t,

and PPn is the distribution of the triple W = (Y,D,X). As in the previous section, we construct simultaneous confidence bands for the parameters (θuj)uU,j[p˜] based on a random sample (Wi)i=1n=(Yi,Di,Xi)i=1n from the distribution of W = (Y, D,X).

3.2. Orthogonal score functions.

Before setting up score functions that satisfy both the moment conditions (1.2) and the orthogonality condition (1.4), observe that “naive” score functions that follow directly from the model (3.1),

muj(W,θuj,θu[p˜]\j,βu,ru)={YuΛ(Djθ+Xj(θu[p˜]\j,βu))ru}Dj,

where Xj=(D[p˜]\j,X), satisfy the moment conditions EP[muj(W,θuj)] = 0 but violate the orthogonality condition (1.4) [with muj replacing ψuj and ηuj=(θu[p˜]\j,βu,ru)]. To satisfy the orthogonality condition (1.4), we proceed using an approach from Section 2.2 as in the Introduction. Specifically, for each uU and j[p˜], define the (p˜+p1)-vector of additional nuisance parameters γuj by (1.5) where fu2=fu2(D,X) is defined in (1.6). Thus, by the first-order condition of (1.5), the nuisance parameters γuj satisfy

fuDj=fuXjγuj+vuj,EP[fuXjvuj]=0. (3.2)

Also, denote βuj=(θu[p˜]\j,βu). Then we set

ψuj(W,θuj,ηuj)={YuΛ(Djθuj+Xjβuj)ru}(DjXjγuj),

where the nuisance parameter is ηuj=(ru,βuj,γuj). As we formally demonstrate in the proof of Theorem 3.1 below, this function satisfies the near-orthogonality condition (1.4).

3.3. Estimation using orthogonal score functions.

Next, we discuss estimation of ηuj’s and θuj’s. First, we assume that the approximation error ru = ru (D, X) is asymptotically negligible, so that it can be estimated by O=O(D,X), the identically zero function of D and X. Second, for γuj, we consider an estimator γ~uj defined as a post-regularization weighted least squares estimator corresponding to the problem (1.5). Third, for βuj, we consider a plug-in estimator β^uj=(θ~u[p˜]\j,β~u), where θ~u and β~u are suitable estimators of θu and βu. In particular, we assume that θ~u and β~u are post-regularization maximum likelihood estimators corresponding to the log-likelihood function (θ, β) ↦ −Mu (W, θ, β) where

Mu(W,θ,β)=(1{Yu=1} log Λ(Dθ+Xβ)+1{Yu=0} log (1Λ(Dθ+Xβ))). (3.3)

The details of the estimators θ~u, β~u and γ~uj are given in Algorithm 1 below. The results in this paper can also be easily extended to the case where θ~u, β~u and γ~uj are replaced by penalized maximum likelihood estimators θ^u and β^u and penalized weighted least squares estimator γ^uj, respectively.

Then our estimator of ηuj is η^uj=(O,β^uj,γ~uj). Substituting this estimator into the score function ψuj gives

ψuj(W,θuj,η^uj)={YuΛ(Djθuj+Xjβ^uj)}(DjXjγ~uj), (3.4)

which, using the sample analog (2.1) of the moment conditions (1.2), gives the following estimator of θuj:

θˇujarginfθΘuj|En[ψuj(W,θ,η^uj)]|. (3.5)

The algorithm is summarized as follows.

ALGORITHM 1. For each uU and j[p˜]:

  • Step 1. Run post-1-penalized logistic estimator (4.2) of Yu on D and X to compute (θ~u,β~u).

  • Step 2. Define the weights f^u2=f^u2(D,X)=Λ(Dθ~u+Xβ~u).

  • Step 3. Run the Post-Lasso estimator (4.5) of fu^Dj to compute γ~uj.

  • Step 4. Compute β^uj=(θ~u[p˜]\j,β~u).

  • Step 5. Solve (3.5) with ψuj(W,θ,η^uj) defined in (3.4) to compute θˇuj.

3.4. Regularity conditions.

Next, we specify our regularity conditions. For all uU and j[p˜], denote Zuj=DjXjγuj. Also, denote an=pp˜n. Let q, c1 and C1 be some strictly positive (and finite) constants where q > 4. Moreover, let (δn)n≥1 and (Δn)n≥1 be some sequences of positive constants converging to zero. Finally, let (Mn,1)n≥1 and (Mn,2)n≥1 be some sequences of positive constants, possibly growing to infinity, where Mn,1 ≥ 1 and Mn,2 ≥ 1 for all n.

Assumption 3.1 (Parameters). For all uU, we have θu+βu+maxj[p˜]γujC1 and maxj[p˜] supθΘuj|θ|C1. In addition, for all u1,u2U, we have (θu2θu1+βu2βu1)C1|u2u1|. Finally, for all uU and j[p˜], Θuj contains a ball of radius (log log n) (log an) 3/2/n1/2 centered at θuj.

Assumption 3.2 (Sparsity). There exist s = sn and γ¯uj, uU and j[p˜], such that for all uU, βu0+θu0+maxj[p˜]γ¯uj0sn and maxj[p˜](γ¯ujγuj+sn1/2γ¯ujγuj1)C1(sn log an/n)1/2.

Assumption 3.3 (Distribution of Y). The conditional pdf of Y given (D, X) is bounded by C1.

Assumptions 3.1–3.3 are mild and standard in the literature. In particular, Assumption 3.1 requires the parameter spaces Θuj to be bounded, and also requires that for each uU and j[p˜], the parameter θuj to be sufficiently separated from the boundaries of the parameter space Θuj. Assumption 3.2 requires approximate sparsity of the model (3.1). Note that in Assumption 3.2, given that γ¯uj’s exist, we can and will assume without loss of generality that γ¯uj=γuTj for some T{1,,p+p˜1} with |T| ≤ sn, where T=Tuj is allowed to depend on u and j. Here, the (p+p˜1)-vector γuTj is defined from γuj by keeping all components of γuj that are in T and setting all other components to be zero. Assumption 3.3 can be relaxed at the expense of more technicalities.

Assumption 3.4 (Covariates). For all uU, the following inequalities hold: (i) infξ=1EP[fu2{(D,X)ξ}2]c1, (ii) minj,k(EP[|fu2ZujXkj|2]EP[|fu2DjXkj|2])c1, and (iii) maxj,kEP[|ZujXkj|3]1/3 log1/2anδnn1/6. In addition, we have that (iv) supξ=1EP[{(D,X)ξ}4]C1, (v) Mn,1EP[supuU,j[p˜]|Zuj|2q]1/(2q) (vi) Mn,12sn log anδnn1/21/q, (vii) Mn,2{EP[(DX)2q]}1/(2q), (viii) Mn,22sn log 1/2anδnn1/21/q and (ix) Mn,12Mn,24snδnn13/q.

This assumption requires that there is no multicollinearity between covariates in vectors D and X. In addition, it requires that the constants y_ and y¯ are chosen so that the probabilities of Y<y¯ and Y>y_ are both nonvanishing since otherwise we would have E[fu2]=E[VarP(Yu|D,X)] vanishing either for u = 0 or u = 1 violating Assumption 3.4(i). Intuitively, sending y_ and y¯ to the left and to the right tails of the distribution of Y, respectively, would blow up the variance of the estimators θˇuj, given by σuj2 in Theorem 2.1, and leading eventually to the estimators with slower-than-n rate of convergence. Although our results could be extended to allow for the case where y_ and y¯ are sent to the tails of the distribution of Y slowly, we skip this extension for the sake of clarity. Moreover, Assumption 3.4 imposes constraints on various moments of covariates. Since these constraints might be difficult to grasp, at the end of this section, in Corollary 3.3, we provide an example for which these constraints simplify into easily interpretable conditions.

Assumption 3.5 (Approximation error). For all uU, we have (i) supξ=1EP[ru2{(D,X)ξ}2]C1EP[ru2], (ii) EP[ru2]C1sn log an/n, (iii) maxj[p˜]|EP[ruZuj]|δnn1/2, and (iv) |ru(D,X)|fu2(D,X)/4 almost surely. In addition, with probability 1Δ¯n, (v) supuU,j[p˜](En[(ruZuj/fu)2]+En[ru2/fu6])C1sn log an/n.

This assumption requires the approximation error ru = ru(D, X) to be sufficiently small. Under Assumption 3.4, the first condition of Assumption 3.5 holds if the approximation error is such that ru2CEP[ru2] almost surely for some constant C.

3.5. Formal results.

Under specified assumptions, our estimators θˇuj satisfy the following uniform Bahadur representation theorem.

Theorem 3. 1 (Uniform Bahadur representation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all PPn. In addition, suppose that the following growth condition holds: δn2 log an=o(1). Then for the estimators θˇuj satisfying (3.5), we have

nσuj1(θˇujθuj)=Gnψ¯uj+OP(δn) (3.6)

in (U×[p˜]), uniformly over PPn, where ψ¯uj(W):=σuj1Juj1ψuj(W,θuj,ηuj), σuj2:=EP[Juj2ψuj2(W,θuj,ηuj)], and Juj is defined in (2.3).

This theorem allows us to establish a Gaussian approximation result for the supremum of the process {nσuj1(θˇujθuj):uU,j[p˜]}:

Corollary 3.1 (Gaussian approximation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all PPn. In addition, suppose that the following growth conditions hold: δn2 log an=o(1), Mn,12/7 log an=o(n1/7) and Mn,12/3 log an=o(n1/32/(3q)). Then

supt|PP (supuU,j[p˜]|nσuj1(θˇujθuj)|t) PP (supuU,j[p˜]|Nuj|t)|=o(1)

uniformly over PPn, where (Nuj)uU,j[p˜] is a tight zero-mean Gaussian process indexed by U×[p˜] with the covariance given by EP[ψ¯uj(W)ψ¯uj(W)] for u, uU and j, j,j[p˜].

Based on this corollary, we are now able to construct simultaneous confidence bands for the parameters θuj. Observe that

Juj=EP[Λ(Djθuj+Xjβuj)Dj(DjXjγuj)],uU,j[p˜],

and so it can be estimated by

J^uj=En[Λ(Djθ~uj+Xjβ^uj)Dj(DjXjγ~uj)],uU,j[p˜].

In addition, σuj2=EP[Juj2ψuj2(W,θuj,ηuj)], and so it can be estimated by

σ^uj2=En[J^uj2ψuj2(W,θ~uj,η^uj)],uU,j[p˜].

Moreover, as in Section 2, define ψ^uj(W)=σ^uj1J^uj1ψuj(W,θˇuj,η^uj), and let cα be the (1 – α) quantile of the conditional distribution of supuU,j[p˜]|G^uj| given the data (Wi)i=1n where the process G^=(G^uj)uU,j[p˜] is defined in (2.8). Then we have the following.

Corollary 3.2 (Simultaneous confidence bands for logistic model). Suppose that Assumptions 3.1–3.5 hold for all PPn. In addition, suppose that the following growth conditions hold: δn2 log an=o(1), Mn,12/7 log an=o(n1/7), Mn,12/3 log an=o(n1/32/(3q)) and sn log3 an = o(n). Then (1.3) holds uniformly over PPn.

To conclude this section, we provide an example for which conditions of Corollary 3.2 are easy to interpret. Recall that an=npp˜.

Corollary 3.3 (Uniform confidence bands for logistic regression model under simple conditions). Suppose that Assumptions 3.1–3.3, 3.4(i,ii,iv) and 3.5(i,ii,iv,v) hold for q > 4for all PPn. In addition, suppose that {EP[(DX)2q]}1/(2q)C1 and supuU,j[p˜]γuj1C1. Moreover, suppose that the following growth conditions hold: log7 an /n = o(1), sn2 log 3an/n12/q=o(1) and supuU,j[p˜]|EP[ruZuJ]|=o((n log an)1/2). Then (1.3) holds uniformly over PPn.

Comment 3.1 (Estimation of variance). When constructing the confidence bands based on (1.3), we find in simulations that it is beneficial to replace the estimators σ^uj2 of σuj2 by max {σ^uj2,Σ^uj2} where Σ^uj2=En[f^u2(DXjγ~uj)2] is an alternative consistent estimator of σuj2.

Comment 3.2 (Alternative implementations, double selection). We note that the theory developed here is applicable for different estimators that construct the new score function with the desired orthogonality condition implicitly. For example, the double selection idea yields an implementation of an estimator that is first-order equivalent to the estimator based on the score function. The algorithm yielding the double selection estimator is as follows.

ALGORITHM 2. For each uU and j[p˜]:

  • Step 1′. Run post-1-penalized logistic estimator (4.2) of Yu on D and X to compute (θ~u,β~u).

  • Step 2′. Define the weights f^u2=f^u2(D,X)=Λ(Diθ~u+Xiβ~u).

  • Step 3′. Run the Lasso estimator (4.4) of fu^Dj on fu^X to compute γ^uj.

  • Step 4′. Run logistic regression of Yu on Dj and all the selected variables in Steps 1′ and 3′ to compute θˇuj.

As mentioned by a referee, it is surprising that the double selection procedure has uniform validity. The use of the additional variables selected in Step 3′, through the first-order conditions of the optimization problem, induces the necessary near­orthogonality condition. We refer to the Supplementary Material for a more detailed discussion.

Comment 3.3 (Alternative implementations, one-step correction). Another implementation for which the theory developed here applies is to replace Step 5 in Algorithm 1 with a one-step procedure. This relates to the debiasing procedure proposed in [43] to the case when the set U is a singleton. In this case, instead of minimizing the criterion (3.5) in Step 5, the method makes a full Newton step from the initial estimate,

  • Step 5″. Compute θ¯uj=θ^ujJ^uj1En[ψuj(W,θ^uj,η^uj)].

The theory developed here directly apply to those estimators as well.

Comment 3.4 (Extension to other approximately sparse generalized linear models). Inspecting the proofs of Theorem 3.1 and Corollaries 3.1–3.3 reveal that these results can be extended with minor modifications to cover other approximately sparse generalized linear models. For example, the results can be extended to cover the model (3.1) where we use the probit link function instead of the logit link function Λ.

4. 1-Penalized M-estimators: Nuisance functions and functional data.

In this section, we define the estimators θ~u, β~u and γ~uj, which were used in the previous section, and study their properties. We consider the same setting as that in the previous section. The results in this section rely upon a set of new results for 1-penalized M-estimators with functional data presented in Appendix M of the Supplementary Material.

4.1. 1-Penalized logistic regression for functional response data: Asymptotic properties.

Here, we consider the generalized linear model with the logistic link function and functional response data (3.1). As explained in the previous section, we assume that θ~u and β~u are post-regularization maximum likelihood estimators of θu and βu corresponding to the log-likelihood function Mu(W, θ, β) =Mu(Yu, D, X, θ, ß) defined in (3.3). To define these estimators, let θ^u and β^u be l-penalized maximum likelihood (logistic regression) estimators

(θ^u,β^u)argminθ,β(En[Mu(Yu,D,X,θ,β)]+λnΨ^u(θ,β)1), (4.1)

where λ is a penalty level and Ψ^u a diagonal matrix of penalty loadings. We choose parameters λ and Ψ^u according to Algorithm 3 described below. Using the 1 - penalized estimators θ^u and β^u, we then define post-regularization estimators θ~u and β~u by

(θ~u,β~u)argminθEn[Mu(Yu,D,X,θ,β)]: supp (θ,β) supp (θ^u,β^u). (4.2)

We derive the rate of convergence and sparsity properties of θ~u and β~u as well as of θ^u and β^u in Theorem 4.1 below. Recall that an=npp˜.

Algorithm 3 (Penalty level and loadings for logistic regression). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define λ=cnΦ1(1γ/(2(p+p˜)Nn)) with Nn = n. To select Ψ^u, choose a constant m¯0 as an upper bound on the number of loops and proceed as follows: (0) Let X~=(D,X), m = 0 and initialize l^uk,0=12{En[X~k2]}1/2 for k[p+p˜]. (1) Compute (θ^u,β^u) and (θ~u,β~u) based on Ψ^uu=diag({l^uk,m,k[p+p˜]}). (2) Set l^uk,m+1:={En[X~k2(YuΛ(Dθ~u+Xβ~u))2]}1/2. (3) If mm¯, report the current value of Ψ^uu and stop; otherwise set mm + 1 and go to step (1).

Theorem 4.1 (Rates and sparsity for functional response under logistic link). Suppose that Assumptions 3.1–3.5 hold for all PPn. In addition, suppose that the penalty level λ and the matrices of penalty loadings Ψ^uu are chosen according to Algorithm 3. Moreover, suppose that the following growth condition holds: δn2 log an=o(1). Then there exists a constant C¯ such that uniformly over all PPn with probability 1 − o(1),

supuU(θ^uθu+β^uβu)C¯sn log ann,supuU(θ^uθu1+β^uβu1)C¯sn2 log ann,

and the estimators θ^u and β^u are uniformly sparse: supuUθ^u0+β^u0C¯sn. Also, uniformly overall PPn, with probability 1 − o(1),

supuU(θ~uθu+β~uβu)C¯sn log ann,supuU(θ~uθu1+β~uβu1)C¯sn2 log ann.

4.2. Lasso with estimated weights: Asymptotic properties.

Here, we consider the weighted linear model (3.2) for uU and j[p˜]. Using the parameter γ¯uj appearing in Assumption 3.2, it will be convenient to rewrite this model as

fuDj=fuXjγ¯uj+fur¯uj+vuj,EP[fuXjvuj]=0, (4.3)

where r¯uj=Xj(γujγ¯uj) is an approximation error, which is asymptotically negligible under Assumption 3.2. As explained in the previous section, we assume that γ~uj is a post-regularization weighted least squares estimator of γuj (or γ¯uj). To define this estimator, let γ^uj be an 1-penalized (weighted Lasso) estimator

γ^ujarg minγ(12En[f^u2(DjXjγ)2]+λnΨ^ujγ1), (4.4)

where λ and Ψ^uuj are the associated penalty level and the diagonal matrix of penalty loadings specified below in Algorithm 4 and where f^u2’s are estimated weights. As in Algorithm 1 in the previous section, we set f^u2=f^u2(D,X)=Λ(Dθ~u+Xβ~u). Using γ^uj, we define a post-regularized weighted least squares estimator:

γ~ujarg minγ12En[f^u2(DjXjγ)2]:supp(γ)supp(γ^uj). (4.5)

We derive the rate of convergence and sparsity properties of γ~uj as well as of γ^uj in Theorem 4.2 below.

Algorithm 4 (Penalty level and loadings for weighted Lasso). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define λ=cnΦ1(1γ/(2(p+p˜)Nn)) with Nn=pp˜2n2. To select Ψ^uuj, choose a constant m¯1 as an upper bound on the number of loops and proceed as follows: (0) Set m = 0 and l^ujk,0=max1inf^uiXij{En[f^u2Dj2]}1/2. (1) Compute γ^uj and γ~uj based on Ψ^uj= diag ({l^ujk,m,k[p+p˜1]}). (2) Set l^ujk,m+1:={En[f^u4(DjXjγ~uj)2(Xkj)2]}1/2. (3) If mm¯, report the current value of Ψ^uuj and stop; otherwise set mm + 1 and go to step (1).

Theorem 4.2 (Rates and sparsity for Lasso with estimated weights). Suppose that Assumptions 3.1–3.5 hold for all PPn. In addition, suppose that the penalty level λ and the matrices of penalty loadings Ψ^uj are chosen according to Algorithm 4. Moreover, suppose that the following growth condition holds: δn2 log an=o(1). Then there exists a constant C¯ such that uniformly over all PPn with probability 1 − o(1),

max j[p˜]supuUγ^ujγ¯ujC¯sn log ann,        max j[p˜]supuUγ^ujγ¯uj1C¯sn2 log ann,

and the estimator γ^uj is uniformly sparse, max j[p˜]supuUγ^uj0C¯sn. Also, uniformly over all PPn, with probability 1 − o(1),

max j[p˜]supuUγ~ujγ¯ujC¯sn log ann,        max j[p˜]supuUγ~ujγ¯uj1C¯sn2 log ann.

5. Monte Carlo simulations.

Here, we investigate the finite sample properties of the proposed estimators and the associated confidence regions. We report only the performance of the estimator based on the double selection procedure due to space constraints and note that it is very similar to the performance of the estimator based on score functions with near-orthogonality property. We will compare the proposed procedure with the traditional estimator that refits the model selected by the corresponding 1-penalized M-estimator (naive post-selection estimator).

We consider a logistic regression model with functional response data where the response Yu = 1{yu} for uU a compact set. We specify two different designs: (1) a location model, y = x′ β0 + ξ, where ξ is distributed as a logistic random variable, the first component of x is the intercept and the other p − 1 components are distributed as N (0, Σ) with Έk,j = |0.5||kj|; (2) a location-shift model, y = {(xβ0 + ξ)/xϑ0}3, where ξ is distributed as alogistic random variable, xj = |wj| where w is a p-vector distributed as N (0, Σ) with Σk,j = |0.5||kj|, and ϑ0 has nonnegative components. Such specification implies that for each uU:

Design1: θu = u(1,0,…,0)′ − β0 and Design 2: θu = u1/3ϑ0β0. In our simulations, we will consider n = 500 and p = 2000. For the location model (Design 1), we will consider two different choices for β0: (i)β0j(i)=2/j2 for j =1,…,p, and (ii) β0j(ii)=(1/2)/{j3.5}2 for j > 1 with the intercept coefficient β0j(ii)=10. [These choices ensure maxj>1 |β0j | = 2 and that y is around zero in Design 2(ii).] We set ϑ0=18(1,1,1,1,0,0,,0,0,1,1,1,1). For Design 1, we have U=[1,2.5] and for Design 2 we have U=[0.5,0.5]. The results are based on 500 replications (the bootstrap procedure is performed 5000 times for each replication).

We report the (empirical) rejection frequencies for confidence regions with 95% nominal coverage, so that 0.05 is the target rejection frequency. We report the rejection frequencies for the proposed estimator and the post-naive selection estimator. Table 1 presents the performance of the methods when applied to construct a confidence interval for a single parameter (p˜=1 and U is a singleton). Since the setting is not symmetric, we investigate the performance for different components. Specifically, we consider {u} × {j} for j = 1,…, 5. First, consider the location model (Design 1). The difference between the performance of the naive estimator for Design 1(i) and 1(ii) highlights its fragile performance which is highly dependent on the unknown parameters. We can see from Table 1 that in Design 1(i) the Naive method achieve (pointwise) rejection frequencies up to 0.162 when the nominal level is 0.05. In Design 1(ii), it can be as high as 0.886. We also note that it is important to look at the performance of each component and avoid averaging across components (large j components are essentially not in the model, indeed for j > 50 we obtain rejection frequencies very close to 0.05 regardless of the model selection procedure). In contrast, the proposed estimator exhibits a much more robust behavior. For Design 1(i), the rejection frequencies are between 0.040 and 0.062 while for Design 1(ii) the rejection frequencies of the proposed estimator are between 0.040 and 0.056.

Table 1.

We report the pointwise rejection frequencies of each method for (pointwise) confidence intervals for each j ∈ {1,…, 5}. For Design 1, we used U={1} and for Design 2, we used U={0.5}. The results are based on 500 replications

p = 2000, n = 500 Rejection frequencies for j ∈{1,…, 5}
Design Method j = 1 j = 2 j = 3 j = 4 j = 5
1(i) Proposed 0.042 0.040 0.062 0.050 0.044
Naive 0.100 0.098 0.108 0.108 0.162
1(ii) Proposed 0.044 0.040 0.054 0.056 0.056
Naive 0.038 0.030 0.070 0.886 0.698
2(i) Proposed 0.046 0.054 0.044 0.052 0.054
Naive 0.046 0.050 0.038 0.070 0.054
2(ii) Proposed 0.092 0.074 0.034 0.088 0.082
Naive 0.034 0.972 0.182 0.312 0.916

Table 2 presents the performance for simultaneous confidence bands of the form {[θ~uj cv σ˜uj,θ~uj+ cv σ˜uj] for uU×[p˜]} where θ~uj is a point estimate, σ˜uj is an estimate of the pointwise standard deviation and cv is a critical value that accounts for the uniform estimation. For the point estimate, we consider the proposed estimator and the post-naive selection estimator which have estimates of standard deviation. We consider two critical values: from the multiplier bootstrap (MB) procedure and the Bonferroni (BF) correction (which we expect to be conservative). For each of the four different designs [1(i), 1(ii), 2(i) and 2(ii) described above], we consider four different choices of U×[p˜]. Table 2 displays rejection frequencies for confidence regions with 95% nominal coverage (and again 0.05 would be the ideal performance). The simulation results confirms the differences between the performance of the methods and overall the proposed procedure is closer to the nominal value of 0.05. The proposed estimator performed within a factor of two to the nominal value in 10 out of the 16 designs considered (and 13 out 16 within a factor of three). The post-naive selection estimator performed within a factor of two only in 3 out of the 16 designs when using the multiplier bootstrap as critical value (7 out of 16 within a factor of three) and similarly with the Bonferroni correction as the critical value.

Table 2.

We report the rejection frequencies of each method for the (uniform) confidence bands for U×p˜. The proposed estimator computes the critical value based on the multiplier bootstrap procedure. For the naive post-selection estimator, we report the results for two choices of critical values, one choice based on the multiplier bootstrap (MB) and another based on Bonferroni (BF) correction. The results are based on 500 replications

p = 2000, n = 500 Uniform over U×p˜
Design Method [1,2.5] × {1} {1} × [10] [1,2.5] × [10] {1} × [1000]
1(i) Proposed 0.054 0.036 0.048 0.040
Naive (MB) 0.126 0.136 0.172 0.032
Naive (BF) 0.014 0.124 0.026 0.032
1(ii) Propose 0.270 0.036 0.032 0.142
Naive (MB) 0.014 0.802 0.934 0.404
Naive (BF) 0.000 0.802 0.718 0.376
Design Method [−0.5, 0.5] × {1} {0.5} × [10] [−0.5,0.5] × [10] {0.5} × [1000]
2(i) Proposed 0.364 0.038 0.052 0.062
Naive (MB) 0.116 0.040 0.022 0.048
Naive (BF) 0.018 0.038 0.000 0.046
2(ii) Proposed 0.140 0.090 0.408 0.084
Naive (MB) 0.002 0.946 0.996 0.362
Naive (BF) 0.000 0.946 0.944 0.298

Supplementary Material

Supplemental Material

APPENDIX A: PROOFS FOR SECTION 2

In this appendix, we use C to denote a strictly positive constant that is independent of n and PPn. The value of C may change at each appearance. Also, the notation anbn means that an ≤ Cbn for all n and some C. The notation anbn means that bnan. Moreover, the notation an = o(1) means that there exists a sequence (bn)n ≥ 1 of positive numbers such that (i) |an| ≤ bn for all n, (ii) bn is independent of PPn for all n and (iii) bn → 0 as n → ∞. Finally, the notation an = Op(bn) means that for all ϵ > 0, there exists C such that Pp(an > Cbn) 1 - ϵ for all n. Using this notation allows us to avoid repeating “uniformly over PPn“ many times in the proofs of Theorem 2.1 and Corollaries 2.1 and 2.2. Throughout this appendix, we assume that n ≥ n0.

Proof of Theorem 2.1.

We split the proof into five steps.

Step 1. (Preliminary rate result). We claim that with probability 1 − o(1), supuU,j[p˜]|θˇujθuj|B1nτn. To show that, note that by definition of θˇuj, we have for each uU and j[p˜],

|En[ψuj(W,θˇuj,η^uj)]|infθΘuj|En[ψuj(W,θ,η^uj)]|+ϵn,

which implies via the triangle inequality that uniformly over uU and j[p˜], with probability 1 − o(1),

|EP[ψuj(W,θ,ηuj)]|θ=θˇuj|ϵn+2I1+2I2B1nτn, (A.1)

where

I1:=supuU,j[p˜],θΘuj|En[ψuj(W,θ,η^uj)]En[ψuj(W,θ,ηuj)]|B1nτn,I2:=supuU,j[p˜],θΘuj|En[ψuj(W,θ,ηuj)]EP[ψuj(W,θ,θuj)]|τn,

and the bounds on I1 and I2 are derived in Step 2 [note also that ∊n = o(τn) by construction of the estimator and Assumption 2.2(vi)]. Since by Assumption 2.1(iv), 21|Juj(θˇujθuj)|c0 does not exceed the left-hand side of (A.1), infuU,j[p˜]|Juj|1, and by Assumption 2.2(vi), B1nτn = o(1), we conclude that

supuU,j[p˜]|θˇujθuj|(infuU,j[p˜]|Juj|)1B1nτnB1nτn, (A.2)

with probability 1 − o(1) yielding the claim of this step.

Step 2. (Bounds on I1 and I2). We claim that with probability 1 − o(1), I1B1nτn and I2 < τn. To show these relations, observe that with probability 1 − o(1), we have I1 ≤ 2I1a + I1b and I2I1a, where

I1a:=supuU,j[p˜],θΘuj,ηTuj|En[ψuj(W,θ,η)]EP[ψuj(W,θ,η)]|,I1b:=supuU,j[p˜],θΘuj,ηTuj|EP[ψuj(W,θ,η)]EP[ψuj(W,θ,ηuj)]|.

To bound I1b, we employ Taylor’s expansion:

I1bsupuU,j[p˜],θΘuj,ηTuj,r[0,1)rEP[ψuj(W,θ,ηuj+r(ηηuj))]B1nsupuU,j[p˜],ηTujηηujeB1nτn,

by Assumptions 2.1(v) and 2.2(ii).

To bound I1a, we apply the maximal inequality of Lemma P.2 to the class F1 defined in Assumption 2.2 to conclude that with probability 1 − o(1),

I1an1/2(vn log an+n1/2+1/qvnKn log an). (A.3)

Here, we used: log supQN(ϵF1Q,2,F1,Q,2)vn log (an/ϵ) for all 0 < ∊ ≤ 1 with ‖F1P,qKn by Assumption 2.2(iv); supfF1fP,22C0 by Assumption 2.2(v); annKn and vn ≥ 1 by the choice of an and vn. In turn, the right-hand side of (A.3) is bounded from above by On) by Assumption 2.2(vi) since (vn log an/n)1/2τn and

n1/2n1/2+1/qvnKn log ann1/2δnn1/2τn.

Combining presented bounds gives the claim of this step.

Step 3. (Linearization). Here, we prove the claim of the theorem. Fix uU and j[p˜]. By definition of θˇuj, we have

n|En[ψuj(W,θˇuj,η^uj)]|infθΘujn|En[ψuj(W,θ,η^uj)]|+ϵnn. (A.4)

Also, for any θ ∊ Θuj and ηTuj, we have

nEn[ψuj(W,θ,η)]=nEn[ψuj(W,θuj,ηuj)]Gnψuj(W,θuj,ηuj)n(EP[ψuj(W,θuj,ηuj)]EP[ψuj(W,θ,η)])+Gnψuj(W,θ,η). (A.5)

Moreover, by Taylor’s expansion of the function r ↦ EP [ψuj(W, θuj + r(θθuj), ηuj + r(ηηuj))],

EP[ψuj(W,θ,η)]EP[ψuj(W,θuj,ηuj)]=Juj(θθuj)+Du,j,0[ηηuj]+21r2EP[W,θuj+r(θθuj),ηuj+r(ηηuj)]|r=r¯ (A.6)

for some r¯(0,1). Substituting this equality into (A.5), taking θ=θˇuj and η=η^uj, and using (A.4) gives

n|En[ψuj(W,θuj,ηuj)]+Juj(θˇujθuj)+Du,j,0[η^ujηuj]|ϵnn+infθΘujn|En[ψuj(W,θ,η^uj)]|+|II1(u,j)|+|II2(u,j)|, (A.7)

where

II1(u,j):=nsupr[0,1)|r2EP[ψuj(W,θuj+r(θθuj),ηuj+r{ηηuj})]|,II2(u,j):=Gn(ψuj(W,θ,η)ψuj(W,θuj,ηuj))

with θ=θˇuj and η=η^uj. It will be shown in Step 4 that

supuU,j[p˜](|II1(u,j)|+|II2(u,j)|)=OP(δn). (A.8)

In addition, it will be shown in Step 5 that

supuU,j[p˜]infθΘujn|En[ψuj(W,θ,η^uj)]|=OP(δn). (A.9)

Moreover, ϵnn=o(δn) by construction of the estimator. Therefore, the expression in (A.7) is OP(δn). Also, supuU,j[p˜]|Du,j,0[η^ujηuj]|=OP(δnn1/2) by the near-orthogonality condition since η^ujTuj for all uU and j[p˜] with probability 1 — o(1) by Assumption 2.2(i). Therefore, Assumption 2.1(iv) gives

supuU,j[p˜]|Juj1nEn[ψuj(W,θuj,ηuj)]+n(θˇujθuj)|=OP(δn).

The asserted claim now follows by dividing both parts of the display above by σuj (under the supremum on the left-hand side) and noting that σuj is bounded below from zero uniformly over uU and j[p˜] by Assumptions 2.2(iii) and 2.2(v).

Step 4. [Bounds on II1(u, j) and II2(u, j)]. Here, we prove (A.8). First, with probability 1 − o(1),

supuU,j[p˜]|II1(u,j)|nB2nsupuU,j[p˜]|θˇujθuj|2η^ujηuje2nB1n2B2nτn2δn,

where the first inequality follows from Assumptions 2.1(v) and 2.2(i), the second from Step 1 and Assumptions 2.2(ii) and 2.2(vi) and the third from Assumption 2.2(vi).

Second, we have with probability 1 − o(1) that supuU,j[p˜]|II2(u,j)|supfF2|Gn(f)|, where

F2={ψuj(,θ,η)ψuj(,θuj,ηuj):uU,j[p˜],ηTuj,|θθuj|CB1nτn}

for sufficiently large constant C. To bound supfF2|Gn(f)|, we applyLemmaP.2. Observe that

supfF2fP,22supuU,j[p˜],|θCB1nτn,ηTujEP[(ψuj(W,θ,η)ψuj(W,θuj,ηuj))2]supuU,j[p˜],|θCB1nτn,ηTujC0(|θθuj|ηηuje)ω(B1nτn)ω,

where we used Assumption 2.1(v) and Assumption 2.2(ii). Also, observe that (B1n τn)ω/2n−ω/4 by Assumption 2.2(vi) since B1n ≥ 1. Therefore, an application of Lemma P.2 with an envelope F2 = 2F1 and σ = (CB1n τn)ω/2 for sufficiently large constant C gives with probability 1 − o(1),

supfF2|Gn(f)|(B1nτn)ω/2vn log an+n1/2+1/qvnKn log an, (A.10)

since supfF2|f|2supfF1|f|2F1 and ‖F1P,qKn by Assumption 2.2(iv) and

 log supQN(ϵF2Q,2,F2,Q,2)vn log (an/ϵ)  for all 0<ϵ1

by Lemma O.1 because F2F1F1 for F1 defined in Assumption 2.2(iv). The claim of this step now follows from an application of Assumption 2.2(vi) to bound the right-hand side of (A.10).

Step 5. Here, we prove (A.9). For all uU and j[p˜], let θ¯uj=θujJuj1En[ψuj(W,θuj,ηuj)]. Then supuU,j[p˜]|θ¯ujθuj|=OP(Sn/n) since Sn=EP[supuU,j[p˜]|nEn[ψuj(Wuj,θuj,ηuj)]|] and Juj is bounded in absolute value below from zero uniformly over uU and j[p˜] by Assumption 2.1(iv). Therefore, θ¯ujΘuj for all uU and j[p˜] with probability 1 − o(1) by Assumption 2.1(i). Hence, with the same probability, for all uU and j[p˜],

infθΘujn|En[ψuj(W,θ,η^uj)]|n|En[ψuj(W,θ¯uj,η^uj)]|,

and so it suffices to show that

supuU,j[p˜]n|En[ψuj(W,θ¯uj,η^uj)]|=OP(δn). (A.11)

To prove (A.11), for given uU and j[p˜], substitute θ=θ¯uj and η=η^uj into (A.5) and use Taylor’s expansion in (A.6). This gives

n|En[ψuj(W,θ¯uj,η^uj)]||II~1(u,j)|+|II~2(u,j)|+n|En[ψuj(W,θuj,ηuj)]+Juj(θ¯ujθuj)+Du,j,0[η^ujηuj]|,

where II~1(u,j) and II~2(u,j) are defined as II1 (u, j) and II2(u, j) in Step 3 but with θˇuj replaced by θ¯uj. Then, given that supuU,j[p˜]|θ¯ujθuj|Sn log n/n with probability 1 − o(1), the argument in Step 4 shows that

supuU,j[p˜](|II~1(u,j)|+|II~2(u,j)|)=OP(δn).

In addition, En[ψuj(W,θuj,ηuj)]+Juj(θ¯ujθuj)=0 by the definition of θ¯uj and supuU,j[p˜]|Du,j,0[η^ujηuj]|=OP(δnn1/2) by the near-orthogonality condition. Combining these bounds gives (A.11), so that the claim of this step follows, and completes the proof of the theorem.

APPENDIX B: REMAINING PROOFS FOR SECTION 2

See the Supplementary Material.

APPENDIX C: PROOFS FOR SECTIONS 3 AND 4

See the Supplementary Material.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework” (DOI: 10.1214/17-AOS1671SUPP; .pdf). The supplemental material contains additional proofs omitted in the main text, a discussion of the double selection method, a set of new results for 1-penalized M-estimators with functional data, additional simulation results, and an empirical application.

1

Note that the C(α)-statistic, or the orthogonal score statistic, had been explicitly used for testing (and also for setting up estimation) in high-dimensional sparse models in [11] and in [39], where it is referred to as the decorrelated score statistic. The discussion of Neyman’s construction here draws on [23]. Note also that our results cover other types of orthogonal score statistics as well, which allows us to cover much broader classes of models; see, for example, the discussion of conditional moment models with infinite-dimensional nuisance parameters below.

REFERENCES

  • [1].Andrews DWK (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62 43–72. [Google Scholar]
  • [2].Belloni A, Chen D, Chernozhukov V and Hansen C (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429. [Google Scholar]
  • [3].Belloni A and Chernozhukov V (2011). 1-Penalized quantile regression for high dimensional sparse models. Ann. Statist 39 82–130. [Google Scholar]
  • [4].Belloni A and Chernozhukov V (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. Available at arXiv:1001.0188. [Google Scholar]
  • [5].Belloni A, Chernozhukov V, Chetverikov D and Wei Y (2018). Supplementto “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework.” DOI: 10.1214/17-AOS1671SUPP. [DOI] [PMC free article] [PubMed]
  • [6].Belloni A, Chernozhukov V, Fernández-Val I and Hansen C (2013). Program evaluation with high-dimensional data. Available at arXiv:1311.2645.
  • [7].Belloni A, Chernozhukov V and Hansen C (2010). Lasso methods for Gaussian instrumental variables models. Available at arXiv:1012.1297.
  • [8].Belloni A, Chernozhukov V and Hansen C (2013). Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics. 10th World Congress of Econometric Society, August 2010, Vol. III. 245–295. Available at arXiv:1201.0220. [Google Scholar]
  • [9].Belloni A, Chernozhukov V and Hansen C (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud 81 608–650. [Google Scholar]
  • [10].Belloni A, Chernozhukov V and Kato K (2013). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Available at arXiv:1312.7186.
  • [11].Belloni A, Chernozhukov V and Kato K (2015). Uniform post selection inference for LAD regression models and other Z-estimators. Biometrika 102 77–94. [Google Scholar]
  • [12].Belloni A, Chernozhukov V and Wang L (2011). Square-root-lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. [Google Scholar]
  • [13].Belloni A, Chernozhukov V and Wang L (2014). Pivotal estimation via square-root Lasso in nonparametric regression. Ann. Statist 42 757–788. [Google Scholar]
  • [14].Chamberlain G (1992). Efficiency bounds for semiparametric regression. Econometrica 60 567–596. [Google Scholar]
  • [15].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econom. J 21 C1–C68. [Google Scholar]
  • [16].Chernozhukov V, Chetverikov D and Kato K (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist 41 2786–2819. [Google Scholar]
  • [17].Chernozhukov V, Chetverikov D and Kato K (2014). Anti-concentration and honest, adaptive confidence bands. Ann. Statist 42 1787–1818. [Google Scholar]
  • [18].Chernozhukov V, Chetverikov D and Kato K (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab 4 2309–2352. [Google Scholar]
  • [19].Chernozhukov V, Chetverikov D and Kato K (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist 42 1564–1597. [Google Scholar]
  • [20].Chernozhukov V, Chetverikov D and Kato K (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70. [Google Scholar]
  • [21].Chernozhukov V, Chetverikov D and Kato K (2015). Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings. Available at arXiv:1502.00352.
  • [22].Chernozhukov V, Fernández-Val I and Melly B (2013). Inference on counterfactual distributions. Econometrica 81 2205–2268. [Google Scholar]
  • [23].Chernozhukov V, Hansen C and Spindler M (2015). Post-selection and post regularization inference in linear models with very many controls and instruments. Am. Econ. Rev. Pap. Proc 105 486–490. [Google Scholar]
  • [24].Deng H and ZHANG C-H (2017). Beyond Gaussian approximation: Bootstrap for maxima of sums of independent random vectors. Available at arXiv:1705.09528.
  • [25].Dudley R (1999). Uniform Central Limit Theorems Cambridge Studies in Advanced Mathematics 63. Cambridge Univ. Press, Cambridge. [Google Scholar]
  • [26].Hothorn T, Kneib T and Bühlmann P (2014). Conditional transformation models.J. Roy. Statist. Soc. Ser. B 76 3–27. [Google Scholar]
  • [27].Javanmard A and Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res 15 2869–2909. [Google Scholar]
  • [28].Javanmard A and Montanari A (2014). Hypothesis testing in high-dimensional regression under the Gaussian random design model: Asymptotic theory. IEEE Trans. Inform. Theory 60 6522–6554. [Google Scholar]
  • [29].Kosorok M (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin. [Google Scholar]
  • [30].Leeb H and Pötscher B (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376. [Google Scholar]
  • [31].Leeb H and Pötscher B (2008). Recent developments in model selection and related areas. Econometric Theory 24 319–322. [Google Scholar]
  • [32].Leeb H and Pötscher BM (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econometrics 142 201–211. [Google Scholar]
  • [33].Linton O (1996). Edgeworth approximation for MINPIN estimators in semiparametric regression models. Econometric Theory 12 30–60. Cowles Foundation Discussion Papers 1086 (1994). [Google Scholar]
  • [34].Mammen E (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist 21 255–285. [Google Scholar]
  • [35].Newey W (1990). Semiparametric efficiency bounds. J. Appl. Econometrics 5 99–135. [Google Scholar]
  • [36].Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica 62 1349–1382. [Google Scholar]
  • [37].Neyman J (1959). Optimal asymptotic tests of composite statistical hypotheses In Probability and Statistics: The Harald Cramér Volume (Grenander U, ed.) 213–234. Almqvist & Wiksell, Stockholm. [Google Scholar]
  • [38].Neyman J (1979). c(α) tests and their use. Sankhyā 41 1–21. [Google Scholar]
  • [39].Ning Y and Liu H (2014). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Available at arXiv:1412.8765.
  • [40].Pötscher B and Leeb H (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. Multivariate Anal 100 2065–2082. [Google Scholar]
  • [41].Robins J and Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc 90 122–129. [Google Scholar]
  • [42].Stein C (1956). Efficient nonparametric testing and estimation In Proc. 3rd Berkeley Symp. Math. Statist. and Probab. 1 187–195. Univ. California Press, Berkeley, CA. [Google Scholar]
  • [43].Van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
  • [44].Van der Vaart A (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge. [Google Scholar]
  • [45].Van der Vaart A and Wellner J (1996). Weak Convergence and Empirical Processes.
  • [46].Zhang C-H and Zhang S (2014). Confidence intervals for low-dimensional parameters with high-dimensional data. J. Roy. Statist. Soc. Ser. B 76 217–242. [Google Scholar]
  • [47].Zhao T, Kolar M and Liu H (2014). A general framework for robust testing and confidence regions in high-dimensional quantile regression. Available at arXiv:1412.8724.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES