UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK

Alexandre Belloni; Victor Chernozhukov; Denis Chetverikov; Ying Wei

doi:10.1214/17-AOS1671

. Author manuscript; available in PMC: 2019 Apr 4.

Published in final edited form as: Ann Stat. 2018 Sep 11;46(6B):3643–3675. doi: 10.1214/17-AOS1671

UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK

Alexandre Belloni ^*, Victor Chernozhukov ^†, Denis Chetverikov ^‡, Ying Wei ^§

PMCID: PMC6449050 NIHMSID: NIHMS1008366 PMID: 30956370

Abstract

In this paper, we develop procedures to construct simultaneous confidence bands for $\tilde{p}$ potentially infinite-dimensional parameters after model selection for general moment condition models where $\tilde{p}$ is potentially much larger than the sample size of available data, n. This allows us to cover settings with functional response data where each of the $\tilde{p}$ parameters is a function. The procedure is based on the construction of score functions that satisfy Neyman orthogonality condition approximately. The proposed simultaneous confidence bands rely on uniform central limit theorems for high-dimensional vectors (and not on Donsker arguments as we allow for $\tilde{p} ≫ n$ ). To construct the bands, we employ a multiplier bootstrap procedure which is computationally efficient as it only involves resampling the estimated score functions (and does not require resolving the high-dimensional optimization problems). We formally apply the general theory to inference on regression coefficient process in the distribution regression model with a logistic link, where two implementations are analyzed in detail. Simulations and an application to real data are provided to help illustrate the applicability of the results.

Keywords: Inference after model selection, moment condition models with a continuum of target parameters, Lasso and Post-Lasso with functional response data

1. Introduction.

High-dimensional models have become increasingly popular in the last two decades. Much research has been conducted on estimation of these models. However, inference about parameters in these models is much less understood, although the literature on inference is growing quickly; see the list of references below. In this paper, we construct simultaneous confidence bands for many functional parameters parameters in a very general framework of moment condition models, where each parameter itself can be an infinite-dimensional object, and the number of such parameters can be much larger than the sample size of available data. Our paper builds upon [11], where simultaneous confidence bands have been constructed for many scalar parameters in a high-dimensional sparse z-estimation framework.

As a substantive application, we apply our general results to provide simultaneous confidence bands for parameters in a logistic regression model with functional response data

E_{P} [Y_{u} | D, X] = Λ (D^{'} θ_{u} + X^{'} β_{u}), u \in U,

(1.1)

where $D = {(D_{1}, \dots, D_{\tilde{p}})}^{'}$ is a $\tilde{p}$ -vector of covariates whose effects are of interest, X = (X₁,…,X_p)′ is a p-vector of controls, $Λ : ℝ \to ℝ$ is the logistic link function, $U = [0, 1]$ is a set of indices and for each $u \in U$ , $Y_{u} = 1 {Y \leq (1 - u) \underline{y} + u \bar{y}}$ for some constants $\underline{y} \leq \bar{y}$ and the response variable Y, $θ_{u 1} = {(θ_{u 1}, \dots, θ_{u \tilde{p}})}^{'}$ is a vector of target parameters and β_u = (β_u1,…,β_up)′ is a vector of nuisance parameters. Here, both $\tilde{p}$ and p are allowed to be potentially much larger than the sample size n, and we have $\tilde{p}$ functional target parameters ${(θ_{u j})}_{u \in U}$ and p functional nuisance parameters ${(β_{u j})}_{u \in U}$ . This example is important because it demonstrates that our methods can be used for inference about the whole distribution of the response variable Y given D and X in a high-dimensional setting, and not only about some particular features of it such as mean or median. This model is called a distribution regression model in [22] and a conditional transformation model in [26], who argue that the model provides a rich class of models for conditional distributions, and offers a useful generalization of traditional proportional hazard models as well as a useful alternative to quantile regression. We develop inference methods to construct simultaneous confidence bands for many functional parameters of this model in Section 3.

Toward this goal, our contributions include to effectively estimate a continuum of high-dimensional nuisance parameters, allow for approximately sparse models, control sparse eigenvalues of a continuum of random matrices, establish an approximate linearization for a collection of “orthogonalized” (or “de-biased”) estimators and establish the validity of a multiplier bootstrap for the construction of confidence bands for the many functional parameters of interest based on these estimators. In particular, these contributions build upon but go much beyond [11] (Corollary 4), which considers the special case of many scalar parameters in a z-estimation framework, and beyond [19] (Theorem 5.1), where simultaneous confidence bands are constructed via multiplier bootstrap for any large collection of approximately linear scalar estimator $\sqrt{n} ({\hat{θ}}_{j} - θ_{j}) = n^{- 1} \sum_{j = 1}^{p} ψ_{j} + r_{n j}$ , j = 1,…,p, where the ℓ_∞-norm of r_n is $o_{P} (1 / \sqrt{log p})$ .

Our general results refer to the problem of estimating the set of parameters ${(θ_{u j})}_{u \in U, j \in [\tilde{p}]}$ in the moment condition model,

E p [ψ_{u j} (W, θ_{u j}, η_{u j})] = 0, u \in U, j \in [\tilde{p}],

(1.2)

where W is a random element that takes values in a measurable space $(W, A_{W})$ according to a probability measure P, $U \subset ℝ^{d_{u}}$ and $[\tilde{p}] : = {1, \dots, \tilde{p}}$ are sets of indices, and for each $u \in U$ and $j \in [\tilde{p}]$ , ψ_uj is a known score function, θ_uj is a scalar parameter of interest and η_uj is a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assuming that a random sample of size n, ${(W_{i})}_{i = 1}^{n}$ , from the distribution of W is available together with suitable estimators ${\hat{η}}_{u j}$ of η_uj, we aim to construct simultaneous confidence bands for ${(θ_{u j})}_{u \in U, j \in [\tilde{p}]}$ that are valid uniformly over a large class of probability measures P, say $P_{n}$ . Specifically, for each $u \in U$ and $j \in [\tilde{p}]$ , we construct an appropriate estimator ${\overset{ˇ}{θ}}_{u j}$ of θ_uj along with an estimator of the standard deviation of $\sqrt{n} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}), {\hat{σ}}_{u j}$ , such that

P_{P} ({\overset{ˇ}{θ}}_{u j} - \frac{c_{α} {\hat{σ}}_{u j}}{\sqrt{n}} \leq θ_{u j} \leq {\overset{ˇ}{θ}}_{u j} + \frac{c_{α} {\hat{σ}}_{u j}}{\sqrt{n}}, \forall u \in U, j \in [\tilde{p}]) \to 1 - α,

(1.3)

uniformly over $P \in P_{n}$ , where α ∈ (0, 1) and c_α is an appropriate critical value, which we choose to construct using a multiplier bootstrap method. The left- and the right-hand sides of the inequalities inside the probability statement (1.3) then can be used as bounds in simultaneous confidence bands for θ_uj’s. In this paper, we are particularly interested in the case when $\tilde{p}$ is potentially much larger than n and $U$ is an uncountable subset of $ℝ^{d_{u}}$ , so that for each $j \in [\tilde{p}]$ , ${(θ_{u j})}_{u \in U}$ is an infinite-dimensional (i.e., functional) parameter.

In the presence of high-dimensional nuisance parameters, construction of valid confidence bands is delicate. Dealing with high-dimensional parameters requires relying upon regularization that leads to lack of asymptotic linearization of the estimators of target parameters since regularized estimators of nuisance parameters suffer from a substantial bias and this bias spreads into the estimators of the target parameters. This lack of asymptotic linearization in turn typically translates into severe distortions in coverage probability of the confidence bands constructed by traditional techniques that are based on perfect model selection; see [30–32, 40]. To deal with this problem, we assume that the score functions ψ_uj are constructed to satisfy a near-orthogonality condition that makes them immune to first-order changes in the value of the nuisance parameter, namely

\partial_{r} {E_{P} [ψ_{u j} (W, θ_{u j}, η_{u j} + r \tilde{η})]} |_{r = 0} \approx 0, u \in U, j \in [\tilde{p}],

(1.4)

for all $\tilde{η}$ in an appropriate set where ∂_r denotes the derivative with respect to r. We shall often refer to this condition as Neyman orthogonality, since in lowdimensional parametric settings the orthogonality property originates in the work of Neyman on the C(α) test in the 50s. In Section 2 below, we describe a few general methods for constructing the score functions ψ_uj obeying the Neyman orthogonality condition.

The Neyman orthogonality condition (1.4) is important because it helps to make sure that the bias from the estimators of the high-dimensional nuisance parameters does not spread into the estimators of the target parameters. In particular, under (1.4), it follows that

E_{P, W} [ψ_{u j} (W, θ_{u j}, {\hat{η}}_{u j})] \approx 0, u \in U, j \in [\tilde{p}],

at least up to the first order, where the index W in E_P,W [·] means that the expectation is taken over W only. This makes the estimators of the target parameters θ_uj immune to the bias in the estimators ${\hat{η}}_{u j}$ , which in turn improves their statistical properties and opens up the possibilities for valid inference.

As the framework (1.2) covers a broad variety of applications, it is instructive to revisit the logistic regression model with functional response data (1.1). To construct score functions ψ_uj that satisfy both the moment conditions (1.2) and the Neyman orthogonality condition (1.4) in this example, for $u \in U$ and $j \in [\tilde{p}]$ , define a $(\tilde{p} + p - 1)$ -vector of additional nuisance parameters

γ_{u}^{j} = arg min_{γ \in ℝ^{\tilde{p} + p - 1}} E_{P} [f_{u}^{2} {D_{j} - X^{j} γ}^{2}],

(1.5)

where $X^{j} = (D_{[\tilde{p}] \ j}^{'}, X^{'}), D_{[\tilde{p}] \ j} = {(D_{1}, \dots, D_{j - 1}, D_{j + 1}, \dots, D_{\tilde{p}})}^{'}$ , and

f_{u}^{2} = f_{u}^{2} (D, X) = {Var}_{P} (Y_{u} | D, X) .

(1.6)

Then, denoting W = (Y,D,X) and splitting θ_u into θ_uj and $θ_{u [\tilde{p}] \ j} = {(θ_{u 1}, \dots, θ_{u j - 1}, θ_{u j + 1}, \dots, θ_{u [\tilde{p}]}^{'})}^{'}$ , we set

ψ_{u j} (W, θ_{u j}, η_{u j}) = {Y_{u} - Λ (D_{j} θ_{u j} + X^{j} β_{u}^{j})} (D_{j} - X^{j} γ_{u}^{j}),

where $η_{u j} = (β_{u}^{j}, γ_{u}^{j})$ and $β_{u}^{j} = {(θ_{u [\tilde{p}] \ j}^{'}, β_{u}^{'})}^{'}$ . It is straightforward to see that these score functions ψ_uj satisfy the moment conditions (1.2) and to see that they also satisfy the Neyman orthogonality condition (1.4), observe that

\begin{array}{l} \partial_{β} {E_{P} [ψ_{u j} (W, θ_{u j}, β, γ_{u}^{j})]} |_{β = β_{u}^{j}} = - E_{P} [f_{u}^{2} {D_{j} - X^{j} γ_{u}^{j}} {(X^{j})}^{'}] = 0, \\ \partial_{γ} {E_{P} [ψ_{u j} (W, θ_{u j}, β_{u}^{j}, γ)]} |_{γ = y_{i i}^{j}} = - E_{P} [{Y_{u} - Λ (D^{'} θ_{u} + X^{'} β_{u})} {(X^{j})}^{'}] = 0, \end{array}

where the first line by definition of $f_{u}^{2}$ and $γ_{u}^{j}$ since Var_P (Y_u | D, X) = Λ′ (D′ θ_u + X′ β_u), and the second by (1.1). Because of this orthogonality condition, we can exploit the moment conditions (1.2) to construct regular, $\sqrt{n}$ -consistent, estimators of θ_uj even if nonregular, regularized or post-regularized, estimators of $η_{u j} = (β_{u}^{j}, γ_{u}^{j})$ are used to cope with high-dimensionality. Using these regular estimators of θ_uj, we then can construct valid confidence bands (1.3).

Our general approach to construct simultaneous confidence bands, which is developed in Section 2, can be described as follows. First, we construct the moment conditions (1.2) that satisfy the Neyman orthogonality condition (1.4), and use these moment conditions to construct estimators ${\overset{ˇ}{θ}}_{u j}$ of θ_uj for all $u \in U$ and $j \in [\tilde{p}]$ . Second, under appropriate regularity conditions, we establish a Bahadur representation for ${\overset{ˇ}{θ}}_{u j}$ ’s. Third, employing the Bahadur representation, we are able to derive a suitable Gaussian approximation for the distribution of ${\overset{ˇ}{θ}}_{u j}$ ’s. Importantly, the Gaussian approximation is possible even if both $\tilde{p}$ and the dimension of the index set $U$ , d_u, are allowed to grow with n, and $\tilde{p}$ asymptotically remains much larger than n. Finally, from the Gaussian approximation, we construct simultaneous confidence bands using a multiplier bootstrap method. Here, the Gaussian and bootstrap approximations are constructed by applying the results on highdimensional central limit and bootstrap theorems established in [16–21] by verifying the conditions there.

Although regularity conditions underlying our approach can be verified for many models defined by moment conditions, for illustration purposes, we explicitly verify these conditions for the logistic regression model with functional response data (1.1) in Section 3. We also note that the regularity conditions, in particular those related to the entropy of the nuisance parameter estimators, can be substantially relaxed if we use sample splitting, so that the nuisance parameters and parameters of interest are estimated on separate samples; see [15]. In addition, we examine the performance of the proposed procedures in a Monte Carlo simulation study and provide an example based on real data in Section 5. Moreover, in the Supplementary Material [5], we discuss the construction of simultaneous confidence bands based on a double-selection estimator. This estimator does not require to explicitly construct the score functions satisfying the Neyman orthogonality condition but nonetheless is first-order equivalent to the estimator based on such functions.

We also develop new results for ℓ₁-penalized M-estimators in Section 4 to handle functional data and criterion functions that depend on nuisance functions for which only estimates are available building on ideas in [3,4,12] (for brevity of the paper, generic results are deferred to Supplementary Material, and Section 4 only contains results that are relevant for the logistic regression model studied in Section 3). Specifically, we develop a method to select penalty parameters for these estimators and extend the existing theory to cover functional data to achieve rates of convergence and sparsity guarantees that hold uniformly over $u \in U$ . The ability to allow both for functional data and for nuisance functions is crucial in the implementation and in theoretical analysis of the methods proposed in this paper.

Orthogonality conditions like that in (1.4) have played an important role in statistics and econometrics. In low-dimensional settings, a similar condition was used by Neyman in [37] and [38] while in semiparametric models the orthogonality conditions were used in [1, 35, 36, 41] and [33]. In high-dimensional settings, [7] and [2] were the first to use the orthogonality condition (1.4) in a linear instrumental variables model with many instruments. Related ideas have also been used in the literature to construct confidence bands in high-dimensional linear models, generalized linear models and other nonlinear models; see [6, 8–11, 13, 27,28,43, 46, 47] and [39], where we can interpret each procedure as implicitly or explicitly constructing and solving an approximately Neyman-orthogonal estimating equation. We contribute to this quickly growing literature by providing procedures to construct simultaneous confidence bands for many infinite-dimensional parameters identified by moment conditions.

Throughout the paper, we use the standard notation from the empirical process theory. In particular, we use $E_{n}$ to denote the expectation with respect to the empirical measure associated with the data ${(W_{i})}_{i = 1}^{n}$ , and we use $G_{n}$ to denote the empirical process $\sqrt{n} (E_{n} - E_{P})$ . More details about the notation are given in the Supplementary Material.

2. Confidence regions for function-valued parameters based on moment conditions.

2.1. Generic construction of confidence regions.

In this section, we state our results under high-level conditions. In the next section, we will apply these results to construct simultaneous confidence bands for many infinite-dimensional parameters in the logistic regression model with functional response data.

Recall that we are interested in constructing simultaneous confidence bands for a set of target parameters ${(θ_{u j})}_{u \in U, j \in [\tilde{p}]}$ where for each $u \in U \subset ℝ^{d_{u}}$ and $j \in [\tilde{p}] = {1, \dots, \tilde{p}}$ , the parameter θ_uj satisfies the moment condition (1.2) with η_uj being a potentially high-dimensional (or infinite-dimensional) nuisance parameter. Assume that θ_uj ∈ Θ_uj, a finite or infinite interval in $ℝ$ , and that η_uj ∈ T_uj, a convex set in a normed space equipped with a norm ∥ · ∥_e. We allow $U$ to be a possibly uncountable set of indices, and $\tilde{p}$ to be potentially large.

We assume that a random sample ${(W_{i})}_{i = 1}^{n}$ from the distribution of W is available for constructing the confidence bands. We also assume that for each $u \in U$ and $j \in [\tilde{p}]$ , the nuisance parameter η_uj can be estimated by ${\hat{η}}_{u j}$ using the same data ${(W_{i})}_{i = 1}^{n}$ . In the next section, we discuss examples where ${\hat{η}}_{u j}$ ’s are based on Lasso or Post-Lasso methods (although other modern regularization and postregularization methods can be applied). Our confidence bands will be based on the estimators ${\overset{ˇ}{θ}}_{u j}$ of θ_uj that are for each $u \in U$ and $j \in [\tilde{p}]$ defined as approximate ϵ_n-solutions in Θ_uj to sample analogs of the moment conditions (1.2), that is,

sup_{u \in U, j \in [\tilde{p}]} {| E_{n} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})] | - inf_{θ \in Θ_{u j}} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] |} \leq ϵ_{n},

(2.1)

where ϵ_n = o(δ_nn^−1/2) for all n ≥ 1 and some sequence (δ_n)_n≥1 of positive constants converging to zero.

To motivate the construction of the confidence bands based on the estimators ${\overset{ˇ}{θ}}_{u j}$ , we first study distributional properties of these estimators. To do that, we will employ the following regularity conditions. Let C₀ be a strictly positive (and finite) constant, and for each $u \in U$ and $j \in [\tilde{p}]$ , let $T_{u j}$ be some subset of T_uj, whose properties are specified below in assumptions. In particular, we will choose the sets $T_{u j}$ so that, on the one hand, their complexity does not grow too fast with n but, on the other hand, for each $u \in U$ and $j \in [\tilde{p}]$ , the estimator ${\hat{η}}_{u j}$ takes values in $T_{u j}$ with high probability. As discussed before, we rely on the following nearorthogonality condition.

DEFINITION 2.1 (Near-orthogonality condition). For each $u \in U$ and $j \in [\tilde{p}]$ , we say that ψ_uj obeys the near-orthogonality condition with respect to $T_{u j} \subset T_{u j}$ if the following conditions hold: The Gateaux derivative map

D_{u, j, \bar{r}} [η - η_{u j}] : = \partial_{r} {E_{P} [ψ_{u j} (W, θ_{u j}, η_{u j} + r (η - η_{u j}))]} |_{r = \bar{r}}

exists for all $\bar{r} \in [0, 1)$ and $η \in T_{u j}$ and (nearly) vanishes at $\bar{r} = 0$ , namely,

| D_{u, j, 0} [η - η_{u j}] | \leq C_{0} δ_{n} n^{- 1 / 2},

(2.2)

for all $η \in T_{u j}$ .

At the end of this section, we describe several methods to obtain score functions ψ_uj that obey the near-orthogonality condition. Together these methods cover a wide variety of applications.

Let ω and c₀ be some strictly positive (and finite) constants, and let n₀ ≥ 3 be some positive integer. Also, let (B_1n)_n≥1 and (B_2n)_n≥1 be some sequences of positive constants, possibly growing to infinity, where B_1n ≥ 1 for all n ≥ 1. In addition, denote

\begin{array}{l} S_{n} : = E_{P} [sup_{u \in U, j \in [\tilde{p}]} | \sqrt{n} E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})]], \\ J_{u j} : = \partial_{θ} {E_{P} [ψ_{u j} (W, θ, η_{u j})]} |_{θ = θ_{u j}} . \end{array}

(2.3)

The quantity $S_{n}$ measures how rich the process ${ψ_{u j} (\cdot, θ_{u j}, η_{u j}) : u \in U, j \in [\tilde{p}]}$ is. The quantity J_uj measures the degree of identifiability of θ_uj by the moment condition (1.2). In many applications, it is bounded in absolute value from above and away from zero. Finally, let $P_{n}$ be a set of probability measures P of possible distributions of W on the measurable space $(W, A_{W})$ .

We collect our main conditions on the score functions ψ_uj and the true values of the target parameters θ_uj in the following assumption.

Assumption 2.1 (Moment condition problem). For all n ≥ n₀, $P \in P_{n}$ , $u \in U$ , and $j \in [\tilde{p}]$ , the following conditions hold: (i) The true parameter value θ_uj obeys (1.2), and Θ_uj contains a ball of radius $C_{0} n^{- 1 / 2} S_{n} log n$ centered at θ_uj. (ii) The map (θ,η) ↦ E_P[ψ_uj(W,θ,η)] is twice continuously Gateaux-differentiable on $Θ_{u j} \times T_{u j}$ . (iii) The score function ψ_uj obeys the nearorthogonality condition given in Definition 2.1 for the set $T_{u j} \subset T_{u j}$ . (iv) For all θ ∈ Θ_uj, |E_P[ψ_uj(W,θ,η_uj)]| ≥ 2⁻¹|J_uj(θ - θ_uj)| ∧ c₀, where J_uj satisfies c₀ ≤ |J_uj | ≤ C₀. (v) For all r ∈ [0, 1), θ ∈ Θ_uj, and η ∈ T_uj:

$E_{P} [{(ψ_{u j} (W, θ, η) - ψ_{u j} (W, θ_{u j}, η_{u j}))}^{2}] \leq C_{0} {(| θ - θ_{u j} | \lor ‖ η - η_{u j} ‖_{e})}^{ω}$ ,
$| \partial_{r} E_{P} [ψ_{u j} (W, θ, η_{u j} + r (η - η_{u j}))] | \leq B_{1 n} ‖ η - η_{u j} ‖_{e}$ ,
$| \partial_{r}^{2} E_{P} [ψ_{u j} (W, θ_{u j} + r (θ - θ_{u j}), η_{u j} + r (η - η_{u j}))] | \leq B_{2 n} ({| θ - θ_{u j} |}^{2} \lor ‖ η - η_{u j} ‖_{e}^{2})$ .

Assumption 2.1 is mild and standard in moment condition problems. Assumption 2.1 (i) requires θ_uj to be sufficiently separated from the boundary of Θ_uj. Assumption 2.1(ii) requires that the functions (θ,η) ↦ E_P[ψ_uj(W,θ,η)] are smooth. It is a mild condition because it does not require smoothness of the functions (θ, η) ↦ ψ_uj(W,θ,η). Assumption 2.1(iii) is our key condition and is discussed above. Assumption 2.1(iv) implies sufficient identifiability of θ_uj. In particular, it implies that the equation E_P[ψ_uj(W,θ,η_uj)] = 0 has only one solution θ = θ_uj. If this equation has multiple solutions, Assumption 2.1(iv) implies that the set Θ_uj is restricted enough so that there is only one solution in Θ_uj. Assumption 2.1(v-a) means that the functions (θ, η) ↦ ψ_uj(W,θ,η) mapping Θ_uj × T_uj into L²(P) are Lipschitz-continuous at (θ, η) = (θ_uj,η_uj) with Lipschitz order ω/2. In most applications, we can set ω = 2. Assumptions 2.1(v-b,v-c) impose smoothness bounds on the functions (θ, η) ↦ E_P[ψ_uj(W,θ,η)].

Next, we state our conditions related to the estimators ${\hat{η}}_{u j}$ . Let (Δ_n)_n≥1 and (τ_n)_n≥1 be some sequences of positive constants converging to zero. Also, let (a_n)_n≥1, (v_n)_n≥1, and (K_n)_n≥1 be some sequences of positive constants, possibly growing to infinity, where a_n ≥ n ∨ K_n and v_n ≥ 1 for all n ≥ 1. Finally, let q ≥ 2 be some constant.

Assumption 2.2 (Estimation of nuisance parameters). For all n ≥ n₀ and $P \in P_{n}$ , the following conditions hold: (i) With probability at least 1 − Δ_n, we have ${\hat{η}}_{u j} \in T_{u j}$ for all $u \in U$ and $j \in [\tilde{p}]$ . (ii) For all $u \in U$ , $j \in [\tilde{p}]$ , and $η \in T_{u j}$ , ∥η − η_uj ∥_e ≤ τ_n. (iii) For all $u \in U$ and $j \in [\tilde{p}]$ , we have $η_{u j} \in T_{u j}$ . (iv) The function class $F_{1} = {ψ_{u j} (\cdot, θ, η) : u \in U, j \in [\tilde{p}], θ \in Θ_{u j}, η \in T_{u j}}$ is suitably measurable and its uniform entropy numbers obey

sup_{Q} log N (ϵ ‖ F_{1} ‖_{Q, 2}, F_{1}, ‖ \cdot ‖_{Q, 2}) \leq v_{n} log (a_{n} / ϵ) for all 0 < ϵ \leq 1,

(2.4)

where F₁ is a measurable envelope for $F_{1}$ that satisfies ∥F₁∥_P,_q ≤ K_n. (v) For all $f \in F_{1}$ , we have c₀ ≤ ∥ f ∥_P,2 ≤ C₀. (vi) The complexity characteristics a_n and v_n satisfy:

${(v_{n} log a_{n} / n)}^{1 / 2} \leq C_{0} τ_{n}$ ,
${(B_{1 n} τ_{n} + S_{n} log n / \sqrt{n})}^{ω / 2} {(v_{n} log a_{n})}^{1 / 2} + n^{- 1 / 2 + 1 / q} v_{n} K_{n} log a_{n} \leq C_{0} δ_{n}$ ,
$n^{1 / 2} B_{1 n}^{2} B_{2 n} τ_{n}^{2} \leq C_{0} δ_{n}$ .

Assumption 2.2 provides sufficient conditions for the estimation of the nuisance parameters ${(η_{u j})}_{u \in U, j \in [\tilde{p}]}$ . Assumption 2.2 (i) requires that the set $T_{u j}$ is large enough so that ${\hat{η}}_{u j} \in T_{u j}$ with high probability. Assumptions 2.2 (i,ii) together require that the estimator ${\hat{η}}_{u j}$ converges to η_uj with the rate τ_n. This rate should be fast enough so that Assumptions 2.2(vi-b,vi-c) are satisfied. Assumption 2.2(iv) gives a bound on the complexity of the set $T_{u j}$ expressed via uniform entropy numbers, and Assumptions 2.2(vi-a,vi-b) require that the set $T_{u j}$ is small enough so that its complexity does not grow too fast. Assumption 2.2(v) requires that the functions (θ, η) ↦ ψ_uj(W,θ,η) are scaled properly. Suitable measurability of $F_{1}$ , required in Assumption 2.2(iv), is a mild condition that is satisfied in most practical cases; see the Supplementary Material and [25] for clarifications. Overall, Assumption 2.2 shows the trade-off in the choice of the sets $T_{u j}$ : setting $T_{u j}$ large, on the one hand, makes it easy to satisfy Assumption 2.2(i) but, on the other hand, yields large values of a_n and v_n in Assumption 2.2(iv) making it difficult to satisfy Assumption 2.2(vi).

We stress that the class $F_{1}$ does not need to be Donsker because its uniform entropy numbers are allowed to increase with n. This is important because allowing for non-Donsker classes is necessary to deal with high-dimensional nuisance parameters. Note also that our conditions are very different from the conditions imposed in various settings with nonparametrically estimated nuisance functions; see, for example, [44, 45] and [29].

In addition, we emphasize that the conditions stated in Assumption 2.2 are sufficient for our results for the general model (1.2) but can often be relaxed if the structure of the functions ψ_uj(W,θ,η) is known. For example, it is possible to relax Assumption 2.2(vi) if the functions ψ_uj(W,θ,η) are linear in θ, which happens in the linear regression model with θ being the coefficient on the covariate of interest; see [9]. Moreover, it is possible to relax the entropy condition (2.4) of Assumption 2.2 by relying upon sample splitting, where part of the data is used to estimate η_uj, and the other part is used to estimate θ_uj given an estimate ${\hat{η}}_{u j}$ of η_uj; see [2] and [15]. By swapping the role of two parts, and averaging the resulting two estimators, we do not incur any efficiency losses.

The following theorem is our first main result in this paper.

Theorem 2.1 (Uniform Bahadur representation). Under Assumptions 2.1 and 2.2, for an estimator ${({\overset{ˇ}{θ}}_{u j})}_{u \in U, j \in [\tilde{p}]}$ that obeys (2.1), we have

\sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) = G_{n} {\bar{ψ}}_{u j} + O_{P} (δ_{n})

(2.5)

in $ℓ^{\infty} (U \times [\tilde{p}])$ uniformly over $P \in P_{n}$ , where ${\bar{ψ}}_{u j} (\cdot) : = - σ_{u j}^{- 1} J_{u j}^{- 1} ψ_{u j} (\cdot, θ_{u j}, η_{u j})$ and $σ_{u j}^{2} : = J_{u j}^{- 2} E_{P} [ψ_{u j}^{2} (W, θ_{u j}, η_{u j})]$ .

Comment 2.1 (On the proof ofTheorem 2.1). To prove this theorem, we use the following identity:

\sqrt{n} E_{P, W} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j}) - ψ_{u j} (W, θ_{u j}, η_{u j})] = - {\sqrt{n E}}_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})]

(2.6)

+ \sqrt{n} E_{n} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})] + G_{n} ψ_{u j} (W, θ_{u j}, η_{u j}) - G_{n} ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j}) .

(2.7)

Here, the term on the right-hand side of (2.6) is the main term on the right-hand side of (2.5), up to a normalization (σ_uj J_uj)^−l. Also, we show that the first term in (2.7) is O_P (δ_n) since ${\overset{ˇ}{θ}}_{u j}$ satisfies (2.1). Moreover, using arather standard theory of Z-estimators, we show that ${\overset{ˇ}{θ}}_{u j} - θ_{u j} = O_{P} (B_{1 n} τ_{n})$ . This in turn allows us to show with the help of empirical process arguments that the difference of the last two terms in (2.7) is O_P (δ_n) as well. (In [15], we also point out that this difference is O_P (δ_n) under much weaker entropy conditions than those in Assumption 2.2 if ${\hat{η}}_{u j}$ and ${\overset{ˇ}{θ}}_{u j}$ are obtained using separate samples.) Thus, it remains to show that the left-hand side of (2.6) is equal to the left-hand side of (2.5) up to an approximation error O_P (δ_n) and up to a normalization (σ_ui j_uj)⁻¹. To do so, we use second-order Taylor’s expansion of the function

f (r) = \sqrt{n} E_{P, W} [ψ_{u j} (W, θ_{u j} + r ({\overset{ˇ}{θ}}_{u j} - θ_{u j}), η_{u j} + r ({\hat{η}}_{u j} - η_{u j}))]

at r = 1 around r = 0. This gives

\begin{array}{l} \sqrt{n} E_{P, W} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j}) - ψ_{u j} (W, θ_{u j}, η_{u j})] \\ = f (1) - f (0) \\ = \sqrt{n} J_{u j} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) + \sqrt{n} D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] + \sqrt{n} f^{″} (\bar{r}) / 2 \end{array}

for some $\bar{r} \in (0, 1)$ . Here, $\sqrt{n} f^{″} (\bar{r}) = O_{P} (δ_{n})$ follows from Assumptions 2.1 and 2.2 and the key near-orthogonality condition also allows us to show that $\sqrt{n} D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] = O_{P} (δ_{n})$ . Without this condition, the term $\sqrt{n} D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}]$ would give first-order bias and lead to slower-than- $\sqrt{n}$ rate of convergence of the estimator ${\overset{ˇ}{θ}}_{u j}$ . Finally, again using the empirical process arguments, we can show that all the bounds including the term O_P (δ_n) hold uniformly over $u \in U$ and $j \in [\tilde{p}]$ .

Comment 2.2 (On uniformity in u in Theorem 2.1). When the functions $u \mapsto \sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) - G_{n} {\bar{ψ}}_{u j}$ are Lipschitz-continuous, one can use a simple discretization argument to conclude that the approximation in (2.5) holds uniformly over $(u, j) \in U \times [\tilde{p}]$ as long as we can show that it holds for each $(u, j) \in U \times [\tilde{p}]$ . However, in many applications, including the distribution regression model discussed in Section 3, this function is actually not continuous, and the location of jumps depends on the data. Therefore, we have to rely on a more complicated argument to establish uniformity in u in the bound (2.5).

The uniform Bahadur representation derived in Theorem 2.1 is useful for the construction of simultaneous confidence bands for ${(θ_{u j})}_{u \in U, j \in [\tilde{p}]}$ as in (1.3). For this purpose, we apply new high-dimensional central limit and bootstrap theorems that have been recently developed in a sequence of papers [16, 18–20] and [21]. To apply these theorems, we make use of the following regularity condition.

Let ${({\bar{δ}}_{n})}_{n \geq 1}$ be a sequence of positive constants converging to zero. Also, let (ϱ_n)_n≥1, ${({\bar{ϱ}}_{n})}_{n \geq 1}$ , (A_n)_n≥1, ${({\bar{A}}_{n})}_{n \geq 1}$ , and (L_n)_n≥1 be some sequences of positive constants, possibly growing to infinity, where ϱ_n ≥ 1, A_n ≥ n, and ${\bar{A}}_{n} \geq n$ for all n ≥ 1. In addition, from now on, we assume that q > 4. Denote by ${\hat{ψ}}_{u j} (\cdot) : = - {\hat{σ}}_{u j}^{- 1} {\hat{J}}_{u j}^{- 1} ψ_{u j} (\cdot, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})$ an estimator of ${\hat{ψ}}_{u j} (\cdot)$ , with ${\hat{J}}_{u j}$ and ${\hat{σ}}_{u j}$ being suitable estimators of J_uj and σ_uj.

Assumption 2.3 (Additional score regularity). For all n ≥ n₀ and $P \in P_{n}$ , the following conditions hold: (i) The function class $F_{0} = {{\bar{ψ}}_{u j} (\cdot) : u \in U, j \in [\tilde{p}]}$ is suitably measurable and its uniform entropy numbers obey

sup_{Q} log N (ϵ ‖ F_{0} ‖_{Q, 2}, F_{0}, ‖ \cdot ‖_{Q, 2}) \leq ϱ_{n} log (A_{n} / ϵ) for all 0 < ϵ \leq 1,

where F₀ is a measurable envelope for $F_{0}$ that satisfies ∥F₀∥_P,q ≤ L_n. (ii) For all $f \in F_{0}$ and k = 3,4, we have $E_{P} [| f (W) |^{k}] \leq C_{0} L_{n}^{k - 2}$ . (iii) The function class ${\hat{F}}_{0} = {{\bar{ψ}}_{u j} (\cdot) - {\hat{ψ}}_{u j} (\cdot) : u \in U, j \in [\tilde{p}]}$ satisfies with probability $1 - Δ_{n} : log N (ϵ, {\hat{F}}_{0}, ‖ \cdot ‖ ℙ_{n, 2}) \leq {\bar{ϱ}}_{n} log ({\bar{A}}_{n} / ϵ)$ for all 0 < ϵ ≤ 1 and $‖ f ‖_{ℙ_{n}, 2} \leq {\bar{δ}}_{n}$ for all $f \in {\hat{F}}_{0}$ .

This assumption is technical, and its verification in applications is rather standard. For the Gaussian approximation result below, we actually only need the first and the second part of this assumption. The third part will be needed for establishing validity of the simultaneous confidence bands based on the multiplier bootstrap procedure. As a side note, observe that Assumption 2.3 allows to bound $S_{n}$ defined in (2.3) and used in Assumptions 2.1 and 2.2; see Appendix G of the Supplementary Material.

Next, let ${(N_{u j})}_{u \in U, j \in [\tilde{p}]}$ denote a tight zero-mean Gaussian process indexed by $U \times [\tilde{p}]$ with covariance operator given by $E_{P} [{\bar{ψ}}_{u j} (W) {\bar{ψ}}_{u^{'} j^{'}} (W)]$ for $u, u^{'} \in U$ and $\begin{array}{l} j, j^{'} \in [\tilde{p}] \end{array}$ . We have the following corollary of Theorem 2.1, which is our second main result in this paper.

Corollary 2.1 (Gaussian approximation). Suppose that Assumptions 2.1, 2.2 and 2.3(i,ii) hold. In addition, suppose that thefollowing growth conditions are satisfied: $δ_{n}^{2} ϱ_{n} log A_{n} = o (1)$ , $L_{n}^{2 / 7} ϱ_{n} log A_{n} = o (n^{1 / 7})$ and $L_{n}^{2 / 3} ϱ_{n} log A_{n} = o (n^{1 / 3 - 2 / (3 q)})$ . Then

sup_{t \in ℝ} | P_{P} (sup_{u \in U, j \in [\tilde{p}]} | \sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) | \leq t) - P_{P} (sup_{u \in U, j \in [\tilde{p}]} | N_{u j} | \leq t) | = o (1)

uniformly over $P \in P_{n}$ .

Based on Corollary 2.1, we are now able to construct simultaneous confidence bands for θ_uj’s as in (1.3). In particular, we will use the Gaussian multiplier bootstrap method employing the estimates ${\hat{ψ}}_{u j}$ of ${\bar{ψ}}_{u j}$ . To describe the method, define the process

\hat{G} = {({\hat{G}}_{u j})}_{u \in U, j \in [\tilde{p}]} = {(\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} ξ_{i} {\hat{ψ}}_{u j} (W_{i}))}_{u \in U, j \in [\tilde{p}]},

(2.8)

where ${(ξ_{i})}_{i = 1}^{n}$ are independent standard normal random variables which are independent from the data ${(W_{i})}_{i = 1}^{n}$ . Then the multiplier bootstrap critical value c_α is defined as the (1 − α) quantile of the conditional distribution of ${sup}_{u \in U, j \in [\tilde{p}]} | {\hat{G}}_{u j} |$ given the data ${(W_{i})}_{i = 1}^{n}$ . To prove validity of this critical value for the construction of simultaneous confidence bands of the form (1.3), we will impose the following additional assumption. Let (ε_n)_n≥1 be a sequence of positive constants converging to zero.

Assumption 2.4 (Variation estimation). For all n ≥ n₀ and $P \in P_{n}$ ,

P_{P} (sup_{u \in U, j \in [\tilde{p}]} | \frac{{\hat{σ}}_{u j}}{σ_{u j}} - 1 | > ε_{n}) \leq Δ_{n} .

The following corollary establishing validity of the multiplier bootstrap critical value c_a for the simultaneous confidence bands construction is our third main result in this paper.

Corollary 2.2 (Simultaneous confidence bands). Suppose that Assumptions 2.1–2.4 hold. In addition, suppose that the growth conditions of Corollary 2.1 hold. Finally, suppose that ε_n ϱ_n log A_n = o(1), and ${\bar{δ}}_{n}^{2} {\bar{ϱ}}_{n} ϱ_{n} (log {\bar{A}}_{n}) \cdot (log A_{n}) = o (1)$ . Then (1.3) holds uniformly over $P \in P_{n}$ .

Comment 2.3 (Confidence bands based on other bootstrap schemes). Results in [24] suggest that the conditions of Corollary 2.2 can be somewhat relaxed if, instead of using the Gaussian weights in the multiplier bootstrap method, we use Mammen’s weights as in [34] or if we use the empirical bootstrap instead of the multiplier bootstrap. Since the results in [24] apply only to high-dimensional random vectors and do not apply to infinite-dimensional random processes, we leave a formal discussion of the results under these alternative bootstrap schemes to future work.

2.2. Construction of score functions satisfying the orthogonality condition.

Here, we discuss several methods for generating orthogonal scores in a wide variety of settings, including the classical Neyman’s construction. In what follows, since the argument applies to each u and j, it is convenient to omit the indices u and j and also to use the subscript 0 to indicate the true values of the parameters. For simplicity, we also focus the discussion on the exactly orthogonal case. With these simplifications, we can restate the orthogonality condition as follows: we say that the score ψ obeys the Neyman orthogonality condition with respect to $η_{0} \in T$ if the following conditions hold: The Gateaux derivative map

D_{\bar{r}} [η - η_{0}] : = \partial_{r} {E_{P} [ψ (W, θ_{0}, η_{0} + r (η - η_{0})]} |_{r = \bar{r}}

exists for all $\bar{r} \in [0, 1)$ and $η \in T$ and vanishes at $\bar{r} = 0$ , namely,

D_{u, j, 0} [η - η_{0}] = 0,

(2.9)

for all $η \in T$ .

(1) Orthogonal scores for likelihood problems with finite-dimensional nuisance parameters. In likelihood settings with finite-dimensional parameters, the construction of orthogonal equations was proposed by Neyman [37] who used them in construction of his celebrated C(α)-statistic.¹

To describe the construction, suppose that the log-likelihood function associated to observation W is (θ, β) ↦ ℓ(W, θ, β), where $θ \in Θ \subset ℝ^{d}$ is the target parameter and $β \in T \subset ℝ^{p_{0}}$ is the nuisance parameter. Under regularity conditions, the true parameter values θ₀ and β₀ obey

E_{P} [\partial_{θ} ℓ (W, θ_{0}, β_{0})] = 0, E_{P} [\partial_{β} ℓ (W, θ_{0}, β_{0})] = 0.

(2.10)

Now consider the new score function

ψ (W, θ, η) = \partial_{θ} ℓ (W, θ, β) - μ \partial_{β} ℓ (W, θ, β),

(2.11)

where the nuisance parameter is

η = {(β^{'}, vec {(μ)}^{'})}^{'} \in T \times D \subset ℝ^{p}, p = p_{0} + d p_{0},

μ is the d × p₀ orthogonalization parameter matrix whose true value μ₀ solves the equation

J_{θ β} - μ J_{β β} = 0 (i.e., μ_{0} = J_{θ β} J_{β β}^{- 1}),

And

J = (\begin{matrix} J_{θ θ} & J_{θ β} \\ J_{β θ} & J_{β β} \end{matrix}) = \partial_{(θ^{'}, β^{'})} E_{P} [\partial_{{(θ^{'}, β^{'})}^{'}} ℓ (W, θ, β)] |_{θ = θ_{0}; β = β_{0}} .

Provided that μ₀ is well defined, we have by (2.10) that E_P[ψ(W, θ₀,η₀)] = 0, where $η_{0} = {(β_{0}^{'}, v e c {(μ_{0})}^{'})}^{'}$ . Moreover, it is trivial to verify that under standard regularity conditions the score function ψ obeys the near orthogonality condition (2.2) exactly (i.e., with C₀ = 0), that is,

\partial_{η} E_{P} [ψ (W, θ_{0}, η)] |_{η = η_{0}} = 0.

Note that in this example, μ₀ not only creates the necessary orthogonality but also creates the efficient score for inference on the main parameter θ, as emphasized by Neyman.

(2) Orthogonal scores for likelihood problems with infinite-dimensional nuisance parameters. The Neyman’s construction can be extended to semi-parametric models, where the nuisance parameter β is a function. In this case, the original score functions (θ, β) ↦ ∂_θℓ(W, θ, β) corresponding to the log-likelihood function (θ, β) ↦ ℓ(W, θ, β) associated to observation W can be transformed into efficient score functions ψ that obey the exact orthogonality condition (2.9) by projecting the original score functions onto the orthocomplement of the tangent space induced by the nuisance parameter β; see Chapter 25 of [44] for a detailed description of this construction. Note that the projection may create additional nuisance parameters, so that the new nuisance parameter η could be of larger dimension than β. Other relevant references include [9, 29, 45] and [11]. The approach is related to Neyman’s construction in the sense that the score ψ arising in this model is actually the Neyman’s score arising in a one-dimensional least favorable parametric subfamily, [42]; see Chapter 25 of [44] for details.

(3) Orthogonal scores for conditional moment problems with infinite-dimensional nuisance parameters. Next, consider a conditional moment restrictions framework studied by Chamberlain [14]. To define the framework, let W, D and V be random vectors in $ℝ^{d_{W}}$ , $ℝ^{d_{D}}$ and $ℝ^{d_{V}}$ , respectively, with D and V being subvectors of W, so that d_D + d_V ≤ d_W. Also, let $θ \in Θ \subset ℝ^{d_{θ}}$ be a finite-dimensional parameter whose true value θ₀ is of interest, and let $h : ℝ^{d_{V}} \to ℝ^{d_{h}}$ be a vectorvalued functional nuisance parameter, with the true value being $h_{0} : ℝ^{d_{V}} \to ℝ^{d_{h}}$ . The conditional moment restrictions framework assumes that θ₀ and h₀ satisfy the following equation:

E_{P} [m (W, θ_{0}, h_{0} (V)) | D, V] = 0,

(2.12)

where $m : ℝ^{d_{W}} \times ℝ^{d_{θ}} \times ℝ^{d_{h}} \to ℝ^{d_{m}}$ is some known function. This framework is of interest because it covers an extremely rich variety of models, without having to explicitly rely on the likelihood formulation. For example, it covers the partial linear model

Y = D θ_{0} + h_{0} (V) + U, E_{P} [U | D, V] = 0,

(2.13)

where Y is a scalar dependent random variable, D is a scalar independent treatment random variable, V is a vector of control random variables and U is a scalar unobservable noise random variable. Indeed, (2.13) implies (2.12) by setting W = (Y, D,V′)′ and m(W, θ, h) = Y − Dθ − h.

Here, we would like to build a (generalized) score function (θ, η) ↦ ψ(W, θ, η) for estimating θ₀, the true value of parameter θ, where η is a new nuisance parameter with true value η₀, that obeys the near orthogonality condition (2.2). There are many ways to do so but one particularly useful way is the following. Consider the functional parameters $Σ : ℝ^{d_{D} + d_{V}} \to ℝ^{d_{m} \times d_{m}}$ and $φ : ℝ^{d_{D} + d_{V}} \to ℝ^{d_{θ} \times d_{m}}$ whose true values are given by

\begin{array}{l} Σ_{0} (D, V) = E_{P} [m (W, θ_{0}, h_{0} (V)) m {(W, θ_{0}, h_{0} (V))}^{'} | D, V], \\ φ_{0} (D, V) = {(A_{0} (D, V) - Γ_{0} (D, V) G_{0} (V))}^{'}, \end{array}

where

\begin{array}{l} A_{0} (D, V) = \partial_{θ^{'}} E_{P} [m (W, θ, h_{0} (V)) | D, V] |_{θ = θ_{0}}, \\ Γ_{0} (D, V) = \partial_{h^{'}} E_{P} [m (W, θ_{0}, h) | D, V] |_{h = h_{0} (V)}, \\ G_{0} (V) = {(E_{P} [Γ_{0} {(D, V)}^{'} Σ_{0} {(D, V)}^{- 1} Γ_{0} (D, V) | V])}^{- 1} \times E_{P} [Γ_{0} {(D, V)}^{'} Σ_{0} {(D, V)}^{- 1} A_{0} (D, V) | V] . \end{array}

Then set η = (h, φ, Σ) and η₀ = (h₀,φ₀,Σ₀), and define the score function:

ψ (W, θ, η) = \underset{"instrument"}{\underset{︸}{φ (D, V)}} \underset{weight}{\underset{︸}{Σ {(D, V)}^{- 1}}} \underset{residual}{\underset{︸}{m (W, θ, h (V))}} .

It is rather straightforward to verify that under mild regularity conditions, the score function ψ satisfies the moment condition, E_P[ψ(W, θ₀,η₀)] = 0, and in addition, the orthogonality condition:

\partial_{η} E_{P} [ψ (W, θ_{0}, η)] |_{η = η_{0}} = 0.

Note that this construction gives the efficient score function ψ that yields an estimator of θ₀ achieving the semiparametric efficiency bound, as calculated by Chamberlain [14].

3. Application to logistic regression model with functional response data.

In this section, we apply our main results to a logistic regression model with functional response data described in the Introduction.

3.1. Model.

We consider a response variable $Y \in ℝ$ that induces a functional response ${(Y_{u})}_{u \in U}$ by $Y_{u} = 1 {Y \leq (1 - u) \underline{y} + u \bar{y}}$ for a set of indices $U = [0, 1]$ and some constants $\underline{y} \leq \bar{y}$ . We are interested in the dependence of this functional response on a $\tilde{p}$ -vector of covariates, $D = {(D_{1}, \dots, D_{\tilde{p}})}^{'} \in ℝ^{\tilde{p}}$ , controlling for a p-vector of additional covariates $X = {(X_{1}, \dots, X_{p})}^{'} \in ℝ^{p}$ . We allow both $\tilde{p}$ and p to be (much) larger than the sample size of available data, n.

For each $u \in U$ , we assume that Y_u satisfies the generalized linear model with the logistic link function

E_{P} [Y_{u} | D, X] = Λ (D^{'} θ_{u} + X^{'} β_{u}) + r_{u},

(3.1)

where $θ_{u 1} = {(θ_{u 1}, \dots, θ_{u \tilde{p}})}^{'}$ is a vector of parameters that are of interest, β_u = (βu₁,…,β_up)′ is a vector of nuisance parameters, r_u = r_u(D, X) is an approximation error, $Λ : ℝ \to ℝ$ is the logistic link function defined by

Λ (t) = \frac{exp (t)}{1 + exp (t)}, t \in ℝ,

and $P \in P_{n}$ is the distribution of the triple W = (Y,D,X). As in the previous section, we construct simultaneous confidence bands for the parameters ${(θ_{u j})}_{u \in U, j \in [\tilde{p}]}$ based on a random sample ${(W_{i})}_{i = 1}^{n} = {(Y_{i}, D_{i}, X_{i})}_{i = 1}^{n}$ from the distribution of W = (Y, D,X).

3.2. Orthogonal score functions.

Before setting up score functions that satisfy both the moment conditions (1.2) and the orthogonality condition (1.4), observe that “naive” score functions that follow directly from the model (3.1),

m_{u j} (W, θ_{u j}, θ_{u [\tilde{p}] \ j}, β_{u}, r_{u}) = {Y_{u} - Λ (D_{j} θ + X^{j} {(θ_{u [\tilde{p}] \ j}^{'}, β_{u}^{'})}^{'}) - r_{u}} D_{j},

where $X^{j} = (D_{[\tilde{p}] \ j}^{'}, X^{'})$ , satisfy the moment conditions E_P[m_uj(W,θ_uj)] = 0 but violate the orthogonality condition (1.4) [with m_uj replacing ψ_uj and $η_{u j} = (θ_{u [\tilde{p}] \ j}, β_{u}, r_{u})$ ]. To satisfy the orthogonality condition (1.4), we proceed using an approach from Section 2.2 as in the Introduction. Specifically, for each $u \in U$ and $j \in [\tilde{p}]$ , define the $(\tilde{p} + p - 1)$ -vector of additional nuisance parameters $γ_{u}^{j}$ by (1.5) where $f_{u}^{2} = f_{u}^{2} (D, X)$ is defined in (1.6). Thus, by the first-order condition of (1.5), the nuisance parameters $γ_{u}^{j}$ satisfy

f_{u} D_{j} = f_{u} X^{j} γ_{u}^{j} + v_{u}^{j}, E_{P} [f_{u} X^{j} v_{u}^{j}] = 0.

(3.2)

Also, denote $β_{u}^{j} = {(θ_{u [\tilde{p}] \ j}^{'}, β_{u}^{'})}^{'}$ . Then we set

ψ_{u j} (W, θ_{u j}, η_{u j}) = {Y_{u} - Λ (D_{j} θ_{u j} + X^{j} β_{u}^{j}) - r_{u}} (D_{j} - X^{j} γ_{u}^{j}),

where the nuisance parameter is $η_{u j} = (r_{u}, β_{u}^{j}, γ_{u}^{j})$ . As we formally demonstrate in the proof of Theorem 3.1 below, this function satisfies the near-orthogonality condition (1.4).

3.3. Estimation using orthogonal score functions.

Next, we discuss estimation of η_uj’s and θ_uj’s. First, we assume that the approximation error r_u = r_u (D, X) is asymptotically negligible, so that it can be estimated by $O = O (D, X)$ , the identically zero function of D and X. Second, for $γ_{u}^{j}$ , we consider an estimator ${\tilde{γ}}_{u}^{j}$ defined as a post-regularization weighted least squares estimator corresponding to the problem (1.5). Third, for $β_{u}^{j}$ , we consider a plug-in estimator ${\hat{β}}_{u}^{j} = ({\tilde{θ}}_{u [\tilde{p}] \ j}^{'}, {\tilde{β}}_{u}^{'})$ , where ${\tilde{θ}}_{u}$ and ${\tilde{β}}_{u}$ are suitable estimators of θ_u and β_u. In particular, we assume that ${\tilde{θ}}_{u}$ and ${\tilde{β}}_{u}$ are post-regularization maximum likelihood estimators corresponding to the log-likelihood function (θ, β) ↦ −M_u (W, θ, β) where

M_{u} (W, θ, β) = - (1 {Y_{u} = 1} log Λ (D^{'} θ + X^{'} β) + 1 {Y_{u} = 0} log (1 - Λ (D^{'} θ + X^{'} β))) .

(3.3)

The details of the estimators ${\tilde{θ}}_{u}$ , ${\tilde{β}}_{u}$ and ${\tilde{γ}}_{u}^{j}$ are given in Algorithm 1 below. The results in this paper can also be easily extended to the case where ${\tilde{θ}}_{u}$ , ${\tilde{β}}_{u}$ and ${\tilde{γ}}_{u}^{j}$ are replaced by penalized maximum likelihood estimators ${\hat{θ}}_{u}$ and ${\hat{β}}_{u}$ and penalized weighted least squares estimator ${\hat{γ}}_{u}^{j}$ , respectively.

Then our estimator of η_uj is ${\hat{η}}_{u j} = (O, {\hat{β}}_{u}^{j}, {\tilde{γ}}_{u}^{j})$ . Substituting this estimator into the score function ψ_uj gives

ψ_{u j} (W, θ_{u j}, {\hat{η}}_{u j}) = {Y_{u} - Λ (D_{j} θ_{u j} + X^{j} {\hat{β}}_{u}^{j})} (D_{j} - X^{j} {\tilde{γ}}_{u}^{j}),

(3.4)

which, using the sample analog (2.1) of the moment conditions (1.2), gives the following estimator of θ_uj:

{\overset{ˇ}{θ}}_{u j} \in arg inf_{θ \in Θ_{u j}} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | .

(3.5)

The algorithm is summarized as follows.

ALGORITHM 1. For each $u \in U$ and $j \in [\tilde{p}]$ :

Step 1. Run post-ℓ₁-penalized logistic estimator (4.2) of Y_u on D and X to compute $({\tilde{θ}}_{u}, {\tilde{β}}_{u})$ .
Step 2. Define the weights ${\hat{f}}_{u}^{2} = {\hat{f}}_{u}^{2} (D, X) = Λ^{'} (D^{'} {\tilde{θ}}_{u} + X^{'} {\tilde{β}}_{u})$ .
Step 3. Run the Post-Lasso estimator (4.5) of $\hat{f_{u}} D_{j}$ to compute ${\tilde{γ}}_{u}^{j}$ .
Step 4. Compute ${\hat{β}}_{u}^{j} = {({\tilde{θ}}_{u [\tilde{p}] \ j}^{'}, {\tilde{β}}_{u}^{'})}^{'}$ .
Step 5. Solve (3.5) with $ψ_{u j} (W, θ, {\hat{η}}_{u j})$ defined in (3.4) to compute ${\overset{ˇ}{θ}}_{u j}$ .

3.4. Regularity conditions.

Next, we specify our regularity conditions. For all $u \in U$ and $j \in [\tilde{p}]$ , denote $Z_{u}^{j} = D_{j} - X^{j} γ_{u}^{j}$ . Also, denote $a_{n} = p \lor \tilde{p} \lor n$ . Let q, c₁ and C₁ be some strictly positive (and finite) constants where q > 4. Moreover, let (δ_n)_n≥1 and (Δ_n)_n≥1 be some sequences of positive constants converging to zero. Finally, let (M_n,1)_n≥1 and (M_n,2)_n≥1 be some sequences of positive constants, possibly growing to infinity, where M_n,1 ≥ 1 and M_n,2 ≥ 1 for all n.

Assumption 3.1 (Parameters). For all $u \in U$ , we have $‖ θ_{u} ‖ + ‖ β_{u} ‖ + {max}_{j \in [\tilde{p}]} ‖ γ_{u}^{j} ‖ \leq C_{1}$ and ${max}_{j \in [\tilde{p}]} {sup}_{θ \in Θ_{u j}} | θ | \leq C_{1}$ . In addition, for all $u_{1}, u_{2} \in U$ , we have $(‖ θ_{u_{2}} - θ_{u_{1}} ‖ + ‖ β_{u_{2}} - β_{u_{1}} ‖) \leq C_{1} | u_{2} - u_{1} |$ . Finally, for all $u \in U$ and $j \in [\tilde{p}]$ , Θ_uj contains a ball of radius (log log n) (log a_n) ^3/2/n^1/2 centered at θ_uj.

Assumption 3.2 (Sparsity). There exist s = s_n and ${\bar{γ}}_{u}^{j}$ , $u \in U$ and $j \in [\tilde{p}]$ , such that for all $u \in U$ , $‖ β_{u} ‖_{0} + ‖ θ_{u} ‖_{0} + {max}_{j \in [\tilde{p}]} ‖ {\bar{γ}}_{u}^{j} ‖_{0} \leq s_{n}$ and ${max}_{j \in [\tilde{p}]} (‖ {\bar{γ}}_{u}^{j} - γ_{u}^{j} ‖ + s_{n}^{- 1 / 2} ‖ {\bar{γ}}_{u}^{j} - γ_{u}^{j} ‖_{1}) \leq C_{1} {(s_{n} log a_{n} / n)}^{1 / 2}$ .

Assumption 3.3 (Distribution of Y). The conditional pdf of Y given (D, X) is bounded by C₁.

Assumptions 3.1–3.3 are mild and standard in the literature. In particular, Assumption 3.1 requires the parameter spaces Θ_uj to be bounded, and also requires that for each $u \in U$ and $j \in [\tilde{p}]$ , the parameter θ_uj to be sufficiently separated from the boundaries of the parameter space Θ_uj. Assumption 3.2 requires approximate sparsity of the model (3.1). Note that in Assumption 3.2, given that ${\bar{γ}}_{u}^{j}$ ’s exist, we can and will assume without loss of generality that ${\bar{γ}}_{u}^{j} = γ_{u T}^{j}$ for some $T \subset {1, \dots, p + \tilde{p} - 1}$ with |T| ≤ s_n, where $T = T_{u}^{j}$ is allowed to depend on u and j. Here, the $(p + \tilde{p} - 1)$ -vector $γ_{u T}^{j}$ is defined from $γ_{u}^{j}$ by keeping all components of $γ_{u}^{j}$ that are in T and setting all other components to be zero. Assumption 3.3 can be relaxed at the expense of more technicalities.

Assumption 3.4 (Covariates). For all $u \in U$ , the following inequalities hold: (i) ${inf}_{‖ ξ ‖ = 1} E_{P} [f_{u}^{2} {(D^{'}, X^{'}) ξ}^{2}] \geq c_{1}$ , (ii) ${min}_{j, k} (E_{P} [{| f_{u}^{2} Z_{u}^{j} X_{k}^{j} |}^{2}] \land E_{P} [{| f_{u}^{2} D_{j} X_{k}^{j} |}^{2}]) \geq c_{1}$ , and (iii) ${max}_{j, k} E_{P} {[{| Z_{u}^{j} X_{k}^{j} |}^{3}]}^{1 / 3} {log}^{1 / 2} a_{n} \leq δ_{n} n^{1 / 6}$ . In addition, we have that (iv) ${sup}_{‖ ξ ‖ = 1} E_{P} [{(D^{'}, X^{'}) ξ}^{4}] \leq C_{1}$ , (v) $M_{n, 1} \geq E_{P} {[{sup}_{u \in U, j \in [\tilde{p}]} {| Z_{u}^{j} |}^{2 q}]}^{1 / (2 q)}$ (vi) $M_{n, 1}^{2} s_{n} log a_{n} \leq δ_{n} n^{1 / 2 - 1 / q}$ , (vii) $M_{n, 2} \geq {E_{P} [{(‖ D ‖_{\infty} \lor ‖ X ‖_{\infty})}^{2 q}]}^{1 / (2 q)}$ , (viii) $M_{n, 2}^{2} s_{n} {log}^{1 / 2} a_{n} \leq δ_{n} n^{1 / 2 - 1 / q}$ and (ix) $M_{n, 1}^{2} M_{n, 2}^{4} s_{n} \leq δ_{n} n^{1 - 3 / q}$ .

This assumption requires that there is no multicollinearity between covariates in vectors D and X. In addition, it requires that the constants $\underline{y}$ and $\bar{y}$ are chosen so that the probabilities of $Y < \bar{y}$ and $Y > \underline{y}$ are both nonvanishing since otherwise we would have $E [f_{u}^{2}] = E [{Var}_{P} (Y_{u} | D, X)]$ vanishing either for u = 0 or u = 1 violating Assumption 3.4(i). Intuitively, sending $\underline{y}$ and $\bar{y}$ to the left and to the right tails of the distribution of Y, respectively, would blow up the variance of the estimators ${\overset{ˇ}{θ}}_{u j}$ , given by $σ_{u j}^{2}$ in Theorem 2.1, and leading eventually to the estimators with slower-than- $\sqrt{n}$ rate of convergence. Although our results could be extended to allow for the case where $\underline{y}$ and $\bar{y}$ are sent to the tails of the distribution of Y slowly, we skip this extension for the sake of clarity. Moreover, Assumption 3.4 imposes constraints on various moments of covariates. Since these constraints might be difficult to grasp, at the end of this section, in Corollary 3.3, we provide an example for which these constraints simplify into easily interpretable conditions.

Assumption 3.5 (Approximation error). For all $u \in U$ , we have (i) ${sup}_{‖ ξ ‖ = 1} E_{P} [r_{u}^{2} {(D^{'}, X^{'}) ξ}^{2}] \leq C_{1} E_{P} [r_{u}^{2}]$ , (ii) $E_{P} [r_{u}^{2}] \leq C_{1} s_{n} log a_{n} / n$ , (iii) ${max}_{j \in [\tilde{p}]} | E_{P} [r_{u} Z_{u}^{j}] | \leq δ_{n} n^{- 1 / 2}$ , and (iv) $| r_{u} (D, X) | \leq f_{u}^{2} (D, X) / 4$ almost surely. In addition, with probability $1 - {\bar{Δ}}_{n}$ , (v) ${sup}_{u \in U, j \in [\tilde{p}]} (E_{n} [{(r_{u} Z_{u}^{j} / f_{u})}^{2}] + E_{n} [r_{u}^{2} / f_{u}^{6}]) \leq C_{1} s_{n} log a_{n} / n$ .

This assumption requires the approximation error r_u = r_u(D, X) to be sufficiently small. Under Assumption 3.4, the first condition of Assumption 3.5 holds if the approximation error is such that $r_{u}^{2} \leq C E_{P} [r_{u}^{2}]$ almost surely for some constant C.

3.5. Formal results.

Under specified assumptions, our estimators ${\overset{ˇ}{θ}}_{u j}$ satisfy the following uniform Bahadur representation theorem.

Theorem 3. 1 (Uniform Bahadur representation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all $P \in P_{n}$ . In addition, suppose that the following growth condition holds: $δ_{n}^{2} log a_{n} = o (1)$ . Then for the estimators ${\overset{ˇ}{θ}}_{u j}$ satisfying (3.5), we have

\sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) = G_{n} {\bar{ψ}}_{u j} + O_{P} (δ_{n})

(3.6)

in $ℓ^{\infty} (U \times [\tilde{p}]),$ uniformly over $P \in P_{n}$ , where ${\bar{ψ}}_{u j} (W) : = - σ_{u j}^{- 1} J_{u j}^{- 1} ψ_{u j} (W, θ_{u j}, η_{u j})$ , $σ_{u j}^{2} : = E_{P} [J_{u j}^{- 2} ψ_{u j}^{2} (W, θ_{u j}, η_{u j})]$ , and J_uj is defined in (2.3).

This theorem allows us to establish a Gaussian approximation result for the supremum of the process ${\sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) : u \in U, j \in [\tilde{p}]}$ :

Corollary 3.1 (Gaussian approximation for logistic model). Suppose that Assumptions 3.1–3.5 hold for all $P \in P_{n}$ . In addition, suppose that the following growth conditions hold: $δ_{n}^{2} log a_{n} = o (1)$ , $M_{n, 1}^{2 / 7} log a_{n} = o (n^{1 / 7})$ and $M_{n, 1}^{2 / 3} log a_{n} = o (n^{1 / 3 - 2 / (3 q)})$ . Then

sup_{t \in ℝ} | P_{P} (sup_{u \in U, j \in [\tilde{p}]} | \sqrt{n} σ_{u j}^{- 1} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) | \leq t) - P_{P} (sup_{u \in U, j \in [\tilde{p}]} | N_{u j} | \leq t) | = o (1)

uniformly over $P \in P_{n}$ , where ${(N_{u j})}_{u \in U, j \in [\tilde{p}]}$ is a tight zero-mean Gaussian process indexed by $U \times [\tilde{p}]$ with the covariance given by $E_{P} [{\bar{ψ}}_{u j} (W) {\bar{ψ}}_{u^{'} j^{'}} (W)]$ for u, $u^{'} \in U$ and j, $j, j^{'} \in [\tilde{p}]$ .

Based on this corollary, we are now able to construct simultaneous confidence bands for the parameters θ_uj. Observe that

J_{u j} = - E_{P} [Λ^{'} (D_{j} θ_{u j} + X^{j} β_{u}^{j}) D_{j} (D_{j} - X^{j} γ_{u}^{j})], u \in U, j \in [\tilde{p}],

and so it can be estimated by

{\hat{J}}_{u j} = - E_{n} [Λ^{'} (D_{j} {\tilde{θ}}_{u j} + X^{j} {\hat{β}}_{u}^{j}) D_{j} (D_{j} - X^{j} {\tilde{γ}}_{u}^{j})], u \in U, j \in [\tilde{p}] .

In addition, $σ_{u j}^{2} = E_{P} [J_{u j}^{- 2} ψ_{u j}^{2} (W, θ_{u j}, η_{u j})]$ , and so it can be estimated by

{\hat{σ}}_{u j}^{2} = E_{n} [{\hat{J}}_{u j}^{- 2} ψ_{u j}^{2} (W, {\tilde{θ}}_{u j}, {\hat{η}}_{u j})], u \in U, j \in [\tilde{p}] .

Moreover, as in Section 2, define ${\hat{ψ}}_{u j} (W) = - {\hat{σ}}_{u j}^{- 1} {\hat{J}}_{u j}^{- 1} ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})$ , and let c_α be the (1 – α) quantile of the conditional distribution of ${sup}_{u \in U, j \in [\tilde{p}]} | {\hat{G}}_{u j} |$ given the data ${(W_{i})}_{i = 1}^{n}$ where the process $\hat{G} = {({\hat{G}}_{u j})}_{u \in U, j \in [\tilde{p}]}$ is defined in (2.8). Then we have the following.

Corollary 3.2 (Simultaneous confidence bands for logistic model). Suppose that Assumptions 3.1–3.5 hold for all $P \in P_{n}$ . In addition, suppose that the following growth conditions hold: $δ_{n}^{2} log a_{n} = o (1)$ , $M_{n, 1}^{2 / 7} log a_{n} = o (n^{1 / 7})$ , $M_{n, 1}^{2 / 3} log a_{n} = o (n^{1 / 3 - 2 / (3 q)})$ and s_n log³ a_n = o(n). Then (1.3) holds uniformly over $P \in P_{n}$ .

To conclude this section, we provide an example for which conditions of Corollary 3.2 are easy to interpret. Recall that $a_{n} = n \lor p \lor \tilde{p}$ .

Corollary 3.3 (Uniform confidence bands for logistic regression model under simple conditions). Suppose that Assumptions 3.1–3.3, 3.4(i,ii,iv) and 3.5(i,ii,iv,v) hold for q > 4for all $P \in P_{n}$ . In addition, suppose that ${E_{P} [{(‖ D ‖_{\infty} \lor ‖ X ‖_{\infty})}^{2 q}]}^{1 / (2 q)} \leq C_{1}$ and ${sup}_{u \in U, j \in [\tilde{p}]} ‖ γ_{u}^{j} ‖_{1} \leq C_{1}$ . Moreover, suppose that the following growth conditions hold: log⁷ a_n /n = o(1), $s_{n}^{2} {log}^{3} a_{n} / n^{1 - 2 / q} = o (1)$ and ${sup}_{u \in U, j \in [\tilde{p}]} | E_{P} [r_{u} Z_{u}^{J}] | = o ({(n log a_{n})}^{- 1 / 2})$ . Then (1.3) holds uniformly over $P \in P_{n}$ .

Comment 3.1 (Estimation of variance). When constructing the confidence bands based on (1.3), we find in simulations that it is beneficial to replace the estimators ${\hat{σ}}_{u j}^{2}$ of $σ_{u j}^{2}$ by max ${{\hat{σ}}_{u j}^{2}, {\hat{Σ}}_{u j}^{2}}$ where ${\hat{Σ}}_{u j}^{2} = E_{n} [{\hat{f}}_{u}^{2} {(D - X^{j} {\tilde{γ}}_{u}^{j})}^{2}]$ is an alternative consistent estimator of $σ_{u j}^{2}$ .

Comment 3.2 (Alternative implementations, double selection). We note that the theory developed here is applicable for different estimators that construct the new score function with the desired orthogonality condition implicitly. For example, the double selection idea yields an implementation of an estimator that is first-order equivalent to the estimator based on the score function. The algorithm yielding the double selection estimator is as follows.

ALGORITHM 2. For each $u \in U$ and $j \in [\tilde{p}]$ :

Step 1′. Run post-ℓ₁-penalized logistic estimator (4.2) of Y_u on D and X to compute $({\tilde{θ}}_{u}, {\tilde{β}}_{u})$ .
Step 2′. Define the weights ${\hat{f}}_{u}^{2} = {\hat{f}}_{u}^{2} (D, X) = Λ^{'} (D_{i}^{'} {\tilde{θ}}_{u} + X_{i}^{'} {\tilde{β}}_{u})$ .
Step 3′. Run the Lasso estimator (4.4) of $\hat{f_{u}} D_{j}$ on $\hat{f_{u}} X$ to compute ${\hat{γ}}_{u}^{j}$ .
Step 4′. Run logistic regression of Y_u on D_j and all the selected variables in Steps 1′ and 3′ to compute ${\overset{ˇ}{θ}}_{u j}$ .

As mentioned by a referee, it is surprising that the double selection procedure has uniform validity. The use of the additional variables selected in Step 3′, through the first-order conditions of the optimization problem, induces the necessary nearorthogonality condition. We refer to the Supplementary Material for a more detailed discussion.

Comment 3.3 (Alternative implementations, one-step correction). Another implementation for which the theory developed here applies is to replace Step 5 in Algorithm 1 with a one-step procedure. This relates to the debiasing procedure proposed in [43] to the case when the set $U$ is a singleton. In this case, instead of minimizing the criterion (3.5) in Step 5, the method makes a full Newton step from the initial estimate,

Step 5″. Compute ${\bar{θ}}_{u j} = {\hat{θ}}_{u j} - {\hat{J}}_{u j}^{- 1} E_{n} [ψ_{u j} (W, {\hat{θ}}_{u j}, {\hat{η}}_{u j})]$ .

The theory developed here directly apply to those estimators as well.

Comment 3.4 (Extension to other approximately sparse generalized linear models). Inspecting the proofs of Theorem 3.1 and Corollaries 3.1–3.3 reveal that these results can be extended with minor modifications to cover other approximately sparse generalized linear models. For example, the results can be extended to cover the model (3.1) where we use the probit link function instead of the logit link function Λ.

4. ℓ₁-Penalized M-estimators: Nuisance functions and functional data.

In this section, we define the estimators ${\tilde{θ}}_{u}$ , ${\tilde{β}}_{u}$ and ${\tilde{γ}}_{u}^{j}$ , which were used in the previous section, and study their properties. We consider the same setting as that in the previous section. The results in this section rely upon a set of new results for ℓ₁-penalized M-estimators with functional data presented in Appendix M of the Supplementary Material.

4.1. ℓ₁-Penalized logistic regression for functional response data: Asymptotic properties.

Here, we consider the generalized linear model with the logistic link function and functional response data (3.1). As explained in the previous section, we assume that ${\tilde{θ}}_{u}$ and ${\tilde{β}}_{u}$ are post-regularization maximum likelihood estimators of θ_u and β_u corresponding to the log-likelihood function M_u(W, θ, β) =M_u(Y_u, D, X, θ, ß) defined in (3.3). To define these estimators, let ${\hat{θ}}_{u}$ and ${\hat{β}}_{u}$ be ℓ_l-penalized maximum likelihood (logistic regression) estimators

({\hat{θ}}_{u}, {\hat{β}}_{u}) \in arg min_{θ, β} (E_{n} [M_{u} (Y_{u}, D, X, θ, β)] + \frac{λ}{n} ‖ {\hat{Ψ}}_{u} {(θ^{'}, β^{'})}^{'} ‖_{1}),

(4.1)

where λ is a penalty level and ${\hat{Ψ}}_{u}$ a diagonal matrix of penalty loadings. We choose parameters λ and ${\hat{Ψ}}_{u}$ according to Algorithm 3 described below. Using the ℓ₁ - penalized estimators ${\hat{θ}}_{u}$ and ${\hat{β}}_{u}$ , we then define post-regularization estimators ${\tilde{θ}}_{u}$ and ${\tilde{β}}_{u}$ by

({\tilde{θ}}_{u}, {\tilde{β}}_{u}) \in arg min_{θ} E_{n} [M_{u} (Y_{u}, D, X, θ, β)] : s u p p (θ, β) \subseteq s u p p ({\hat{θ}}_{u}, {\hat{β}}_{u}) .

(4.2)

We derive the rate of convergence and sparsity properties of ${\tilde{θ}}_{u}$ and ${\tilde{β}}_{u}$ as well as of ${\hat{θ}}_{u}$ and ${\hat{β}}_{u}$ in Theorem 4.1 below. Recall that $a_{n} = n \lor p \lor \tilde{p}$ .

Algorithm 3 (Penalty level and loadings for logistic regression). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define $λ = c \sqrt{n} Φ^{- 1} (1 - γ / (2 (p + \tilde{p}) N_{n}))$ with N_n = n. To select ${\hat{Ψ}}_{u}$ , choose a constant $\bar{m} \geq 0$ as an upper bound on the number of loops and proceed as follows: (0) Let $\tilde{X} = {(D^{'}, X^{'})}^{'}$ , m = 0 and initialize ${\hat{l}}_{u k, 0} = \frac{1}{2} {E_{n} [{\tilde{X}}_{k}^{2}]}^{1 / 2}$ for $k \in [p + \tilde{p}]$ . (1) Compute $({\hat{θ}}_{u}, {\hat{β}}_{u})$ and $({\tilde{θ}}_{u}, {\tilde{β}}_{u})$ based on ${\hat{Ψ}}_{u}_{u} = d i a g ({{\hat{l}}_{u k, m}, k \in [p + \tilde{p}]})$ . (2) Set ${\hat{l}}_{u k, m + 1} : = {E_{n} [{\tilde{X}}_{k}^{2} {(Y_{u} - Λ (D^{'} {\tilde{θ}}_{u} + X^{'} {\tilde{β}}_{u}))}^{2}]}^{1 / 2}$ . (3) If $m \geq \bar{m}$ , report the current value of ${\hat{Ψ}}_{u}_{u}$ and stop; otherwise set m ← m + 1 and go to step (1).

Theorem 4.1 (Rates and sparsity for functional response under logistic link). Suppose that Assumptions 3.1–3.5 hold for all $P \in P_{n}$ . In addition, suppose that the penalty level λ and the matrices of penalty loadings ${\hat{Ψ}}_{u}_{u}$ are chosen according to Algorithm 3. Moreover, suppose that the following growth condition holds: $δ_{n}^{2} log a_{n} = o (1)$ . Then there exists a constant $\bar{C}$ such that uniformly over all $P \in P_{n}$ with probability 1 − o(1),

\begin{array}{l} sup_{u \in U} (‖ {\hat{θ}}_{u} - θ_{u} ‖ + ‖ {\hat{β}}_{u} - β_{u} ‖) \leq \bar{C} \sqrt{\frac{s_{n} log a_{n}}{n}}, \\ sup_{u \in U} (‖ {\hat{θ}}_{u} - θ_{u} ‖_{1} + ‖ {\hat{β}}_{u} - β_{u} ‖_{1}) \leq \bar{C} \sqrt{\frac{s_{n}^{2} log a_{n}}{n}}, \end{array}

and the estimators ${\hat{θ}}_{u}$ and ${\hat{β}}_{u}$ are uniformly sparse: ${sup}_{u \in U} ‖ {\hat{θ}}_{u} ‖_{0} + ‖ {\hat{β}}_{u} ‖_{0} \leq \bar{C} s_{n}$ . Also, uniformly overall $P \in P_{n}$ , with probability 1 − o(1),

\begin{array}{l} sup_{u \in U} (‖ {\tilde{θ}}_{u} - θ_{u} ‖ + ‖ {\tilde{β}}_{u} - β_{u} ‖) \leq \bar{C} \sqrt{\frac{s_{n} log a_{n}}{n}}, \\ sup_{u \in U} (‖ {\tilde{θ}}_{u} - θ_{u} ‖_{1} + ‖ {\tilde{β}}_{u} - β_{u} ‖_{1}) \leq \bar{C} \sqrt{\frac{s_{n}^{2} log a_{n}}{n}} . \end{array}

4.2. Lasso with estimated weights: Asymptotic properties.

Here, we consider the weighted linear model (3.2) for $u \in U$ and $j \in [\tilde{p}]$ . Using the parameter ${\bar{γ}}_{u}^{j}$ appearing in Assumption 3.2, it will be convenient to rewrite this model as

f_{u} D_{j} = f_{u} X^{j} {\bar{γ}}_{u}^{j} + f_{u} {\bar{r}}_{u j} + v_{u}^{j}, E_{P} [f_{u} X^{j} v_{u}^{j}] = 0,

(4.3)

where ${\bar{r}}_{u j} = X^{j} (γ_{u}^{j} - {\bar{γ}}_{u}^{j})$ is an approximation error, which is asymptotically negligible under Assumption 3.2. As explained in the previous section, we assume that ${\tilde{γ}}_{u}^{j}$ is a post-regularization weighted least squares estimator of $γ_{u}^{j}$ (or ${\bar{γ}}_{u}^{j}$ ). To define this estimator, let ${\hat{γ}}_{u}^{j}$ be an ℓ₁-penalized (weighted Lasso) estimator

{\hat{γ}}_{u}^{j} \in arg min_{γ} (\frac{1}{2} E_{n} [{\hat{f}}_{u}^{2} {(D_{j} - X^{j} γ)}^{2}] + \frac{λ}{n} ‖ {\hat{Ψ}}_{u j} γ ‖_{1}),

(4.4)

where λ and ${\hat{Ψ}}_{u}_{u j}$ are the associated penalty level and the diagonal matrix of penalty loadings specified below in Algorithm 4 and where ${\hat{f}}_{u}^{2}$ ’s are estimated weights. As in Algorithm 1 in the previous section, we set ${\hat{f}}_{u}^{2} = {\hat{f}}_{u}^{2} (D, X) = Λ^{'} (D^{'} {\tilde{θ}}_{u} + X^{'} {\tilde{β}}_{u})$ . Using ${\hat{γ}}_{u}^{j}$ , we define a post-regularized weighted least squares estimator:

{\tilde{γ}}_{u}^{j} \in arg min_{γ} \frac{1}{2} E_{n} [{\hat{f}}_{u}^{2} {(D_{j} - X^{j} γ)}^{2}] : supp (γ) \subseteq supp ({\hat{γ}}_{u}^{j}) .

(4.5)

We derive the rate of convergence and sparsity properties of ${\tilde{γ}}_{u}^{j}$ as well as of ${\hat{γ}}_{u}^{j}$ in Theorem 4.2 below.

Algorithm 4 (Penalty level and loadings for weighted Lasso). Choose γ ∈ [1/n, 1/log n] and c > 1 (in practice, we set c = 1.1 and γ = 0.1/log n). Define $λ = c \sqrt{n} Φ^{- 1} (1 - γ / (2 (p + \tilde{p}) N_{n}))$ with $N_{n} = p {\tilde{p}}^{2} n^{2}$ . To select ${\hat{Ψ}}_{u}_{u j}$ , choose a constant $\bar{m} \geq 1$ as an upper bound on the number of loops and proceed as follows: (0) Set m = 0 and ${\hat{l}}_{u j k, 0} = {max}_{1 \leq i \leq n} ‖ {\hat{f}}_{u i} X_{i}^{j} ‖_{\infty} {E_{n} [{\hat{f}}_{u}^{2} D_{j}^{2}]}^{1 / 2}$ . (1) Compute ${\hat{γ}}_{u}^{j}$ and ${\tilde{γ}}_{u}^{j}$ based on ${\hat{Ψ}}_{u j} = d i a g ({{\hat{l}}_{u}_{j k, m}, k \in [p + \tilde{p} - 1]})$ . (2) Set ${\hat{l}}_{u j k, m + 1} : = {E_{n} [{\hat{f}}_{u}^{4} {(D_{j} - X^{j} {\tilde{γ}}_{u}^{j})}^{2} {(X_{k}^{j})}^{2}]}^{1 / 2}$ . (3) If $m \geq \bar{m}$ , report the current value of ${\hat{Ψ}}_{u}_{u j}$ and stop; otherwise set m ← m + 1 and go to step (1).

Theorem 4.2 (Rates and sparsity for Lasso with estimated weights). Suppose that Assumptions 3.1–3.5 hold for all $P \in P_{n}$ . In addition, suppose that the penalty level λ and the matrices of penalty loadings ${\hat{Ψ}}_{u j}$ are chosen according to Algorithm 4. Moreover, suppose that the following growth condition holds: $δ_{n}^{2} log a_{n} = o (1)$ . Then there exists a constant $\bar{C}$ such that uniformly over all $P \in P_{n}$ with probability 1 − o(1),

max_{j \in [\tilde{p}]} sup_{u \in U} ‖ {\hat{γ}}_{u}^{j} - {\bar{γ}}_{u}^{j} ‖ \leq \bar{C} \sqrt{\frac{s_{n} log a_{n}}{n}}, max_{j \in [\tilde{p}]} sup_{u \in U} ‖ {\hat{γ}}_{u}^{j} - {\bar{γ}}_{u}^{j} ‖_{1} \leq \bar{C} \sqrt{\frac{s_{n}^{2} log a_{n}}{n}},

and the estimator ${\hat{γ}}_{u}^{j}$ is uniformly sparse, ${max}_{j \in [\tilde{p}]} {sup}_{u \in U} ‖ {\hat{γ}}_{u}^{j} ‖_{0} \leq \bar{C} s_{n}$ . Also, uniformly over all $P \in P_{n}$ , with probability 1 − o(1),

max_{j \in [\tilde{p}]} sup_{u \in U} ‖ {\tilde{γ}}_{u}^{j} - {\bar{γ}}_{u}^{j} ‖ \leq \bar{C} \sqrt{\frac{s_{n} log a_{n}}{n}}, max_{j \in [\tilde{p}]} sup_{u \in U} {‖ {\tilde{γ}}_{u}^{j} - {\bar{γ}}_{u}^{j} ‖}_{1} \leq \bar{C} \sqrt{\frac{s_{n}^{2} log a_{n}}{n}} .

5. Monte Carlo simulations.

Here, we investigate the finite sample properties of the proposed estimators and the associated confidence regions. We report only the performance of the estimator based on the double selection procedure due to space constraints and note that it is very similar to the performance of the estimator based on score functions with near-orthogonality property. We will compare the proposed procedure with the traditional estimator that refits the model selected by the corresponding ℓ₁-penalized M-estimator (naive post-selection estimator).

We consider a logistic regression model with functional response data where the response Y_u = 1{y ≤ u} for $u \in U$ a compact set. We specify two different designs: (1) a location model, y = x′ β₀ + ξ, where ξ is distributed as a logistic random variable, the first component of x is the intercept and the other p − 1 components are distributed as N (0, Σ) with Έ_k,_j = |0.5|^|k−j|; (2) a location-shift model, y = {(x′ β₀ + ξ)/x′ϑ₀}³, where ξ is distributed as alogistic random variable, x_j = |w_j| where w is a p-vector distributed as N (0, Σ) with Σ_k,j = |0.5|^|k−j|, and ϑ₀ has nonnegative components. Such specification implies that for each $u \in U$ :

Design1: θ_u = u(1,0,…,0)′ − β₀ and Design 2: θ_u = u^1/3ϑ₀ − β₀. In our simulations, we will consider n = 500 and p = 2000. For the location model (Design 1), we will consider two different choices for β₀: (i) $β_{0 j}^{(i)} = 2 / j^{2}$ for j =1,…,p, and (ii) $β_{0 j}^{(i i)} = (1 / 2) / {j - 3.5}^{2}$ for j > 1 with the intercept coefficient $β_{0 j}^{(i i)} = - 10$ . [These choices ensure max_j>1 |β_0j | = 2 and that y is around zero in Design 2(ii).] We set $ϑ_{0} = \frac{1}{8} (1, 1, 1, 1, 0, 0, \dots, 0, 0, 1, 1, 1, 1)^{'}$ . For Design 1, we have $U = [1, 2.5]$ and for Design 2 we have $U = [- 0.5, 0.5]$ . The results are based on 500 replications (the bootstrap procedure is performed 5000 times for each replication).

We report the (empirical) rejection frequencies for confidence regions with 95% nominal coverage, so that 0.05 is the target rejection frequency. We report the rejection frequencies for the proposed estimator and the post-naive selection estimator. Table 1 presents the performance of the methods when applied to construct a confidence interval for a single parameter ( $\tilde{p} = 1$ and $U$ is a singleton). Since the setting is not symmetric, we investigate the performance for different components. Specifically, we consider {u} × {j} for j = 1,…, 5. First, consider the location model (Design 1). The difference between the performance of the naive estimator for Design 1(i) and 1(ii) highlights its fragile performance which is highly dependent on the unknown parameters. We can see from Table 1 that in Design 1(i) the Naive method achieve (pointwise) rejection frequencies up to 0.162 when the nominal level is 0.05. In Design 1(ii), it can be as high as 0.886. We also note that it is important to look at the performance of each component and avoid averaging across components (large j components are essentially not in the model, indeed for j > 50 we obtain rejection frequencies very close to 0.05 regardless of the model selection procedure). In contrast, the proposed estimator exhibits a much more robust behavior. For Design 1(i), the rejection frequencies are between 0.040 and 0.062 while for Design 1(ii) the rejection frequencies of the proposed estimator are between 0.040 and 0.056.

Table 1.

We report the pointwise rejection frequencies of each method for (pointwise) confidence intervals for each j ∈ {1,…, 5}. For Design 1, we used $U = {1}$ and for Design 2, we used $U = {0.5}$ . The results are based on 500 replications

p = 2000, n = 500		Rejection frequencies for j ∈{1,…, 5}
Design	Method	j = 1	j = 2	j = 3	j = 4	j = 5
1(i)	Proposed	0.042	0.040	0.062	0.050	0.044
	Naive	0.100	0.098	0.108	0.108	0.162
1(ii)	Proposed	0.044	0.040	0.054	0.056	0.056
	Naive	0.038	0.030	0.070	0.886	0.698
2(i)	Proposed	0.046	0.054	0.044	0.052	0.054
	Naive	0.046	0.050	0.038	0.070	0.054
2(ii)	Proposed	0.092	0.074	0.034	0.088	0.082
	Naive	0.034	0.972	0.182	0.312	0.916

Open in a new tab

Table 2 presents the performance for simultaneous confidence bands of the form ${[{\tilde{θ}}_{u j} - c v {\tilde{σ}}_{u j}, {\tilde{θ}}_{u j} + c v {\tilde{σ}}_{u j}] for u \in U \times [\tilde{p}]}$ where ${\tilde{θ}}_{u j}$ is a point estimate, ${\tilde{σ}}_{u j}$ is an estimate of the pointwise standard deviation and cv is a critical value that accounts for the uniform estimation. For the point estimate, we consider the proposed estimator and the post-naive selection estimator which have estimates of standard deviation. We consider two critical values: from the multiplier bootstrap (MB) procedure and the Bonferroni (BF) correction (which we expect to be conservative). For each of the four different designs [1(i), 1(ii), 2(i) and 2(ii) described above], we consider four different choices of $U \times [\tilde{p}]$ . Table 2 displays rejection frequencies for confidence regions with 95% nominal coverage (and again 0.05 would be the ideal performance). The simulation results confirms the differences between the performance of the methods and overall the proposed procedure is closer to the nominal value of 0.05. The proposed estimator performed within a factor of two to the nominal value in 10 out of the 16 designs considered (and 13 out 16 within a factor of three). The post-naive selection estimator performed within a factor of two only in 3 out of the 16 designs when using the multiplier bootstrap as critical value (7 out of 16 within a factor of three) and similarly with the Bonferroni correction as the critical value.

Table 2.

We report the rejection frequencies of each method for the (uniform) confidence bands for $U \times \tilde{p}$ . The proposed estimator computes the critical value based on the multiplier bootstrap procedure. For the naive post-selection estimator, we report the results for two choices of critical values, one choice based on the multiplier bootstrap (MB) and another based on Bonferroni (BF) correction. The results are based on 500 replications

p = 2000, n = 500		Uniform over $U \times \tilde{p}$
Design	Method	[1,2.5] × {1}	{1} × [10]	[1,2.5] × [10]	{1} × [1000]
1(i)	Proposed	0.054	0.036	0.048	0.040
	Naive (MB)	0.126	0.136	0.172	0.032
	Naive (BF)	0.014	0.124	0.026	0.032
1(ii)	Propose	0.270	0.036	0.032	0.142
	Naive (MB)	0.014	0.802	0.934	0.404
	Naive (BF)	0.000	0.802	0.718	0.376
Design	Method	[−0.5, 0.5] × {1}	{0.5} × [10]	[−0.5,0.5] × [10]	{0.5} × [1000]
2(i)	Proposed	0.364	0.038	0.052	0.062
	Naive (MB)	0.116	0.040	0.022	0.048
	Naive (BF)	0.018	0.038	0.000	0.046
2(ii)	Proposed	0.140	0.090	0.408	0.084
	Naive (MB)	0.002	0.946	0.996	0.362
	Naive (BF)	0.000	0.946	0.944	0.298

Open in a new tab

Supplementary Material

Supplemental Material

NIHMS1008366-supplement-Supplemental_Material.pdf^{(546.7KB, pdf)}

APPENDIX A: PROOFS FOR SECTION 2

In this appendix, we use C to denote a strictly positive constant that is independent of n and $P \in P_{n}$ . The value of C may change at each appearance. Also, the notation an ≲ b_n means that an ≤ Cb_n for all n and some C. The notation a_n ≳ b_n means that b_n ≲ an. Moreover, the notation a_n = o(1) means that there exists a sequence (b_n)_{n ≥ 1} of positive numbers such that (i) |a_n| ≤ b_n for all n, (ii) b_n is independent of $P \in P_{n}$ for all n and (iii) b_n → 0 as n → ∞. Finally, the notation an = O_p(b_n) means that for all ϵ > 0, there exists C such that P_p(a_n > Cb_n) ≤ 1 - ϵ for all n. Using this notation allows us to avoid repeating “uniformly over $P \in P_{n}$ “ many times in the proofs of Theorem 2.1 and Corollaries 2.1 and 2.2. Throughout this appendix, we assume that n ≥ n₀.

Proof of Theorem 2.1.

We split the proof into five steps.

Step 1. (Preliminary rate result). We claim that with probability 1 − o(1), ${sup}_{u \in U, j \in [\tilde{p}]} | {\overset{ˇ}{θ}}_{u j} - θ_{u j} | ≲ B_{1 n} τ_{n}$ . To show that, note that by definition of ${\overset{ˇ}{θ}}_{u j}$ , we have for each $u \in U$ and $j \in [\tilde{p}]$ ,

| E_{n} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})] | \leq inf_{θ \in Θ_{u j}} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | + ϵ_{n},

which implies via the triangle inequality that uniformly over $u \in U$ and $j \in [\tilde{p}]$ , with probability 1 − o(1),

{| E_{P} [ψ_{u j} (W, θ, η_{u j})] |}_{θ = {\overset{ˇ}{θ}}_{u j}} | \leq ϵ_{n} + 2 I_{1} + 2 I_{2} ≲ B_{1 n} τ_{n},

(A.1)

where

\begin{array}{l} I_{1} : = sup_{u \in U, j \in [\tilde{p}], θ \in Θ_{u j}} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] - E_{n} [ψ_{u j} (W, θ, η_{u j})] | ≲ B_{1 n} τ_{n}, \\ I_{2} : = sup_{u \in U, j \in [\tilde{p}], θ \in Θ_{u j}} | E_{n} [ψ_{u j} (W, θ, η_{u j})] - E_{P} [ψ_{u j} (W, θ, θ_{u j})] | ≲ τ_{n}, \end{array}

and the bounds on I₁ and I₂ are derived in Step 2 [note also that ∊_n = o(τ_n) by construction of the estimator and Assumption 2.2(vi)]. Since by Assumption 2.1(iv), $2^{- 1} | J_{u j} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) | \land c_{0}$ does not exceed the left-hand side of (A.1), ${inf}_{u \in U, j \in [\tilde{p}]} | J_{u j} | ≳ 1$ , and by Assumption 2.2(vi), B_1nτ_n = o(1), we conclude that

sup_{u \in U, j \in [\tilde{p}]} | {\overset{ˇ}{θ}}_{u j} - θ_{u j} | ≲ {(inf_{u \in U, j \in [\tilde{p}]} | J_{u j} |)}^{- 1} B_{1 n} τ_{n} ≲ B_{1 n} τ_{n},

(A.2)

with probability 1 − o(1) yielding the claim of this step.

Step 2. (Bounds on I₁ and I₂). We claim that with probability 1 − o(1), I₁ ≲ B_1nτ_n and I₂ < τ_n. To show these relations, observe that with probability 1 − o(1), we have I₁ ≤ 2I_1a + I_1b and I₂ ≤ I_1a, where

\begin{array}{l} I_{1 a} : = sup_{u \in U, j \in [\tilde{p}], θ \in Θ_{u j}, η \in T_{u j}} | E_{n} [ψ_{u j} (W, θ, η)] - E_{P} [ψ_{u j} (W, θ, η)] |, \\ I_{1 b} : = sup_{u \in U, j \in [\tilde{p}], θ \in Θ_{u j}, η \in T_{u j}} | E_{P} [ψ_{u j} (W, θ, η)] - E_{P} [ψ_{u j} (W, θ, η_{u j})] | . \end{array}

To bound I_1b, we employ Taylor’s expansion:

I_{1 b} \leq sup_{u \in U, j \in [\tilde{p}], θ \in Θ_{u j}, η \in T_{u j}, r \in [0, 1)} \partial_{r} E_{P} [ψ_{u j} (W, θ, η_{u j} + r (η - η_{u j}))] \leq B_{1 n} sup_{u \in U, j \in [\tilde{p}], η \in T_{u j}} ‖ η - η_{u j} ‖_{e} \leq B_{1 n} τ_{n},

by Assumptions 2.1(v) and 2.2(ii).

To bound I_1a, we apply the maximal inequality of Lemma P.2 to the class $F_{1}$ defined in Assumption 2.2 to conclude that with probability 1 − o(1),

I_{1 a} ≲ n^{- 1 / 2} (\sqrt{v_{n} log a_{n}} + n^{- 1 / 2 + 1 / q} v_{n} K_{n} log a_{n}) .

(A.3)

Here, we used: $log {sup}_{Q} N (ϵ ‖ F_{1} ‖_{Q, 2}, F_{1}, ‖ \cdot ‖_{Q, 2}) \leq v_{n} log (a_{n} / ϵ)$ for all 0 < ∊ ≤ 1 with ‖F₁‖_P,q ≤ K_n by Assumption 2.2(iv); ${sup}_{f \in F_{1}} ‖ f ‖_{P, 2}^{2} \leq C_{0}$ by Assumption 2.2(v); a_n ≥ n ∨ K_n and v_n ≥ 1 by the choice of a_n and v_n. In turn, the right-hand side of (A.3) is bounded from above by O(τ_n) by Assumption 2.2(vi) since ${(v_{n} log a_{n} / n)}^{1 / 2} ≲ τ_{n}$ and

n^{- 1 / 2} n^{- 1 / 2 + 1 / q} v_{n} K_{n} log a_{n} ≲ n^{- 1 / 2} δ_{n} ≲ n^{- 1 / 2} ≲ τ_{n} .

Combining presented bounds gives the claim of this step.

Step 3. (Linearization). Here, we prove the claim of the theorem. Fix $u \in U$ and $j \in [\tilde{p}]$ . By definition of ${\overset{ˇ}{θ}}_{u j}$ , we have

\sqrt{n} | E_{n} [ψ_{u j} (W, {\overset{ˇ}{θ}}_{u j}, {\hat{η}}_{u j})] | \leq inf_{θ \in Θ_{u j}} \sqrt{n} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | + ϵ_{n} \sqrt{n} .

(A.4)

Also, for any θ ∊ Θ_uj and $η \in T_{u j}$ , we have

\sqrt{n} E_{n} [ψ_{u j} (W, θ, η)] = \sqrt{n} E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})] - G_{n} ψ_{u j} (W, θ_{u j}, η_{u j}) - \sqrt{n} (E_{P} [ψ_{u j} (W, θ_{u j}, η_{u j})] - E_{P} [ψ_{u j} (W, θ, η)]) + G_{n} ψ_{u j} (W, θ, η) .

(A.5)

Moreover, by Taylor’s expansion of the function r ↦ E_P [ψ_uj(W, θ_uj + r(θ − θ_uj), η_uj + r(η − η_uj))],

E_{P} [ψ_{u j} (W, θ, η)] - E_{P} [ψ_{u j} (W, θ_{u j}, η_{u j})] = J_{u j} (θ - θ_{u j}) + D_{u, j, 0} [η - η_{u j}] + 2^{- 1} \partial_{r}^{2} E_{P} [W, θ_{u j} + r (θ - θ_{u j}), η_{u j} + r (η - η_{u j})] |_{r = \bar{r}}

(A.6)

for some $\bar{r} \in (0, 1)$ . Substituting this equality into (A.5), taking $θ = {\overset{ˇ}{θ}}_{u j}$ and $η = {\hat{η}}_{u j}$ , and using (A.4) gives

\sqrt{n} | E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})] + J_{u j} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) + D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] | \leq ϵ_{n} \sqrt{n} + inf_{θ \in Θ_{u j}} \sqrt{n} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | + | I I_{1} (u, j) | + | I I_{2} (u, j) |,

(A.7)

where

\begin{array}{l} I I_{1} (u, j) : = \sqrt{n} sup_{r \in [0, 1)} | \partial_{r}^{2} E_{P} [ψ_{u j} (W, θ_{u j} + r (θ - θ_{u j}), η_{u j} + r {η - η_{u j}})] |, \\ I I_{2} (u, j) : = G_{n} (ψ_{u j} (W, θ, η) - ψ_{u j} (W, θ_{u j}, η_{u j})) \end{array}

with $θ = {\overset{ˇ}{θ}}_{u j}$ and $η = {\hat{η}}_{u j}$ . It will be shown in Step 4 that

sup_{u \in U, j \in [\tilde{p}]} (| I I_{1} (u, j) | + | I I_{2} (u, j) |) = O_{P} (δ_{n}) .

(A.8)

In addition, it will be shown in Step 5 that

sup_{u \in U, j \in [\tilde{p}]} inf_{θ \in Θ_{u j}} \sqrt{n} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | = O_{P} (δ_{n}) .

(A.9)

Moreover, $ϵ_{n} \sqrt{n} = o (δ_{n})$ by construction of the estimator. Therefore, the expression in (A.7) is O_P(δ_n). Also, ${sup}_{u \in U, j \in [\tilde{p}]} | D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] | = O_{P} (δ_{n} n^{- 1 / 2})$ by the near-orthogonality condition since ${\hat{η}}_{u j} \in T_{u j}$ for all $u \in U$ and $j \in [\tilde{p}]$ with probability 1 — o(1) by Assumption 2.2(i). Therefore, Assumption 2.1(iv) gives

sup_{u \in U, j \in [\tilde{p}]} | J_{u j}^{- 1} \sqrt{n} E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})] + \sqrt{n} ({\overset{ˇ}{θ}}_{u j} - θ_{u j}) | = O_{P} (δ_{n}) .

The asserted claim now follows by dividing both parts of the display above by σ_uj (under the supremum on the left-hand side) and noting that σ_uj is bounded below from zero uniformly over $u \in U$ and $j \in [\tilde{p}]$ by Assumptions 2.2(iii) and 2.2(v).

Step 4. [Bounds on II₁(u, j) and II₂(u, j)]. Here, we prove (A.8). First, with probability 1 − o(1),

sup_{u \in U, j \in [\tilde{p}]} | I I_{1} (u, j) | \leq \sqrt{n} B_{2 n} sup_{u \in U, j \in [\tilde{p}]} | {\overset{ˇ}{θ}}_{u j} - θ_{u j} |^{2} \lor ‖ {\hat{η}}_{u j} - η_{u j} ‖_{e}^{2} ≲ \sqrt{n} B_{1 n}^{2} B_{2 n} τ_{n}^{2} ≲ δ_{n},

where the first inequality follows from Assumptions 2.1(v) and 2.2(i), the second from Step 1 and Assumptions 2.2(ii) and 2.2(vi) and the third from Assumption 2.2(vi).

Second, we have with probability 1 − o(1) that ${sup}_{u \in U, j \in [\tilde{p}]} | I I_{2} (u, j) | ≲ {sup}_{f \in F_{2}} | G_{n} (f) |$ , where

F_{2} = {ψ_{u j} (\cdot, θ, η) - ψ_{u j} (\cdot, θ_{u j}, η_{u j}) : u \in U, j \in [\tilde{p}], η \in T_{u j}, | θ - θ_{u j} | \leq C B_{1 n} τ_{n}}

for sufficiently large constant C. To bound ${sup}_{f \in F_{2}} | G_{n} (f) |$ , we applyLemmaP.2. Observe that

\begin{array}{l} sup_{f \in F_{2}} ‖ f ‖_{P, 2}^{2} \\ \leq sup_{u \in U, j \in [\tilde{p}], | θ - C B_{1 n} τ_{n}, η \in T_{u j}} E_{P} [{(ψ_{u j} (W, θ, η) - ψ_{u j} (W, θ_{u j}, η_{u j}))}^{2}] \\ \leq sup_{u \in U, j \in [\tilde{p}], | θ - C B_{1 n} τ_{n}, η \in T_{u j}} C_{0} {(| θ - θ_{u j} | \lor ‖ η - η_{u j} ‖_{e})}^{ω} ≲ {(B_{1 n} τ_{n})}^{ω}, \end{array}

where we used Assumption 2.1(v) and Assumption 2.2(ii). Also, observe that (B_1n τ_n)^ω/2 ≥ n^−ω/4 by Assumption 2.2(vi) since B_1n ≥ 1. Therefore, an application of Lemma P.2 with an envelope F₂ = 2F₁ and σ = (CB_1n τ_n)^ω/2 for sufficiently large constant C gives with probability 1 − o(1),

sup_{f \in F_{2}} | G_{n} (f) | ≲ {(B_{1 n} τ_{n})}^{ω / 2} \sqrt{v_{n} log a_{n}} + n^{- 1 / 2 + 1 / q} v_{n} K_{n} log a_{n},

(A.10)

since ${sup}_{f \in F_{2}} | f | \leq 2 {sup}_{f \in F_{1}} | f | \leq 2 F_{1}$ and ‖F₁‖_P,q ≤ K_n by Assumption 2.2(iv) and

log sup_{Q} N (ϵ ‖ F_{2} ‖_{Q, 2}, F_{2}, ‖ \cdot ‖_{Q, 2}) ≲ v_{n} log (a_{n} / ϵ) for all 0 < ϵ \leq 1

by Lemma O.1 because $F_{2} \subset F_{1} - F_{1}$ for $F_{1}$ defined in Assumption 2.2(iv). The claim of this step now follows from an application of Assumption 2.2(vi) to bound the right-hand side of (A.10).

Step 5. Here, we prove (A.9). For all $u \in U$ and $j \in [\tilde{p}]$ , let ${\bar{θ}}_{u j} = θ_{u j} - J_{u j}^{- 1} E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})]$ . Then ${sup}_{u \in U, j \in [\tilde{p}]} | {\bar{θ}}_{u j} - θ_{u j} | = O_{P} (S_{n} / \sqrt{n})$ since $S_{n} = E_{P} [{sup}_{u \in U, j \in [\tilde{p}]} | \sqrt{n} E_{n} [ψ_{u j} (W_{u j}, θ_{u j}, η_{u j})] |]$ and J_uj is bounded in absolute value below from zero uniformly over $u \in U$ and $j \in [\tilde{p}]$ by Assumption 2.1(iv). Therefore, ${\bar{θ}}_{u j} \in Θ_{u j}$ for all $u \in U$ and $j \in [\tilde{p}]$ with probability 1 − o(1) by Assumption 2.1(i). Hence, with the same probability, for all $u \in U$ and $j \in [\tilde{p}]$ ,

inf_{θ \in Θ_{u j}} \sqrt{n} | E_{n} [ψ_{u j} (W, θ, {\hat{η}}_{u j})] | \leq \sqrt{n} | E_{n} [ψ_{u j} (W, {\bar{θ}}_{u j}, {\hat{η}}_{u j})] |,

and so it suffices to show that

sup_{u \in U, j \in [\tilde{p}]} \sqrt{n} | E_{n} [ψ_{u j} (W, {\bar{θ}}_{u j}, {\hat{η}}_{u j})] | = O_{P} (δ_{n}) .

(A.11)

To prove (A.11), for given $u \in U$ and $j \in [\tilde{p}]$ , substitute $θ = {\bar{θ}}_{u j}$ and $η = {\hat{η}}_{u j}$ into (A.5) and use Taylor’s expansion in (A.6). This gives

\sqrt{n} | E_{n} [ψ_{u j} (W, {\bar{θ}}_{u j}, {\hat{η}}_{u j})] | \leq | {\tilde{I I}}_{1} (u, j) | + | {\tilde{I I}}_{2} (u, j) | + \sqrt{n} | E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})] + J_{u j} ({\bar{θ}}_{u j} - θ_{u j}) + D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] |,

where ${\tilde{I I}}_{1} (u, j)$ and ${\tilde{I I}}_{2} (u, j)$ are defined as II₁ (u, j) and II₂(u, j) in Step 3 but with ${\overset{ˇ}{θ}}_{u j}$ replaced by ${\bar{θ}}_{u j}$ . Then, given that ${sup}_{u \in U, j \in [\tilde{p}]} | {\bar{θ}}_{u j} - θ_{u j} | ≲ S_{n} log n / \sqrt{n}$ with probability 1 − o(1), the argument in Step 4 shows that

sup_{u \in U, j \in [\tilde{p}]} (| {\tilde{I I}}_{1} (u, j) | + | {\tilde{I I}}_{2} (u, j) |) = O_{P} (δ_{n}) .

In addition, $E_{n} [ψ_{u j} (W, θ_{u j}, η_{u j})] + J_{u j} ({\bar{θ}}_{u j} - θ_{u j}) = 0$ by the definition of ${\bar{θ}}_{u j}$ and ${sup}_{u \in U, j \in [\tilde{p}]} | D_{u, j, 0} [{\hat{η}}_{u j} - η_{u j}] | = O_{P} (δ_{n} n^{- 1 / 2})$ by the near-orthogonality condition. Combining these bounds gives (A.11), so that the claim of this step follows, and completes the proof of the theorem.

APPENDIX B: REMAINING PROOFS FOR SECTION 2

See the Supplementary Material.

APPENDIX C: PROOFS FOR SECTIONS 3 AND 4

See the Supplementary Material.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework” (DOI: 10.1214/17-AOS1671SUPP; .pdf). The supplemental material contains additional proofs omitted in the main text, a discussion of the double selection method, a set of new results for ℓ₁-penalized M-estimators with functional data, additional simulation results, and an empirical application.

Note that the C(α)-statistic, or the orthogonal score statistic, had been explicitly used for testing (and also for setting up estimation) in high-dimensional sparse models in [11] and in [39], where it is referred to as the decorrelated score statistic. The discussion of Neyman’s construction here draws on [23]. Note also that our results cover other types of orthogonal score statistics as well, which allows us to cover much broader classes of models; see, for example, the discussion of conditional moment models with infinite-dimensional nuisance parameters below.

REFERENCES

[1].Andrews DWK (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62 43–72. [Google Scholar]
[2].Belloni A, Chen D, Chernozhukov V and Hansen C (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429. [Google Scholar]
[3].Belloni A and Chernozhukov V (2011). ℓ₁-Penalized quantile regression for high dimensional sparse models. Ann. Statist 39 82–130. [Google Scholar]
[4].Belloni A and Chernozhukov V (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. Available at arXiv:1001.0188. [Google Scholar]
[5].Belloni A, Chernozhukov V, Chetverikov D and Wei Y (2018). Supplementto “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework.” DOI: 10.1214/17-AOS1671SUPP. [DOI] [PMC free article] [PubMed]
[6].Belloni A, Chernozhukov V, Fernández-Val I and Hansen C (2013). Program evaluation with high-dimensional data. Available at arXiv:1311.2645.
[7].Belloni A, Chernozhukov V and Hansen C (2010). Lasso methods for Gaussian instrumental variables models. Available at arXiv:1012.1297.
[8].Belloni A, Chernozhukov V and Hansen C (2013). Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics. 10th World Congress of Econometric Society, August 2010, Vol. III. 245–295. Available at arXiv:1201.0220. [Google Scholar]
[9].Belloni A, Chernozhukov V and Hansen C (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud 81 608–650. [Google Scholar]
[10].Belloni A, Chernozhukov V and Kato K (2013). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Available at arXiv:1312.7186.
[11].Belloni A, Chernozhukov V and Kato K (2015). Uniform post selection inference for LAD regression models and other Z-estimators. Biometrika 102 77–94. [Google Scholar]
[12].Belloni A, Chernozhukov V and Wang L (2011). Square-root-lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. [Google Scholar]
[13].Belloni A, Chernozhukov V and Wang L (2014). Pivotal estimation via square-root Lasso in nonparametric regression. Ann. Statist 42 757–788. [Google Scholar]
[14].Chamberlain G (1992). Efficiency bounds for semiparametric regression. Econometrica 60 567–596. [Google Scholar]
[15].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econom. J 21 C1–C68. [Google Scholar]
[16].Chernozhukov V, Chetverikov D and Kato K (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist 41 2786–2819. [Google Scholar]
[17].Chernozhukov V, Chetverikov D and Kato K (2014). Anti-concentration and honest, adaptive confidence bands. Ann. Statist 42 1787–1818. [Google Scholar]
[18].Chernozhukov V, Chetverikov D and Kato K (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab 4 2309–2352. [Google Scholar]
[19].Chernozhukov V, Chetverikov D and Kato K (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist 42 1564–1597. [Google Scholar]
[20].Chernozhukov V, Chetverikov D and Kato K (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70. [Google Scholar]
[21].Chernozhukov V, Chetverikov D and Kato K (2015). Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings. Available at arXiv:1502.00352.
[22].Chernozhukov V, Fernández-Val I and Melly B (2013). Inference on counterfactual distributions. Econometrica 81 2205–2268. [Google Scholar]
[23].Chernozhukov V, Hansen C and Spindler M (2015). Post-selection and post regularization inference in linear models with very many controls and instruments. Am. Econ. Rev. Pap. Proc 105 486–490. [Google Scholar]
[24].Deng H and ZHANG C-H (2017). Beyond Gaussian approximation: Bootstrap for maxima of sums of independent random vectors. Available at arXiv:1705.09528.
[25].Dudley R (1999). Uniform Central Limit Theorems Cambridge Studies in Advanced Mathematics 63. Cambridge Univ. Press, Cambridge. [Google Scholar]
[26].Hothorn T, Kneib T and Bühlmann P (2014). Conditional transformation models.J. Roy. Statist. Soc. Ser. B 76 3–27. [Google Scholar]
[27].Javanmard A and Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res 15 2869–2909. [Google Scholar]
[28].Javanmard A and Montanari A (2014). Hypothesis testing in high-dimensional regression under the Gaussian random design model: Asymptotic theory. IEEE Trans. Inform. Theory 60 6522–6554. [Google Scholar]
[29].Kosorok M (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin. [Google Scholar]
[30].Leeb H and Pötscher B (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376. [Google Scholar]
[31].Leeb H and Pötscher B (2008). Recent developments in model selection and related areas. Econometric Theory 24 319–322. [Google Scholar]
[32].Leeb H and Pötscher BM (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econometrics 142 201–211. [Google Scholar]
[33].Linton O (1996). Edgeworth approximation for MINPIN estimators in semiparametric regression models. Econometric Theory 12 30–60. Cowles Foundation Discussion Papers 1086 (1994). [Google Scholar]
[34].Mammen E (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist 21 255–285. [Google Scholar]
[35].Newey W (1990). Semiparametric efficiency bounds. J. Appl. Econometrics 5 99–135. [Google Scholar]
[36].Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica 62 1349–1382. [Google Scholar]
[37].Neyman J (1959). Optimal asymptotic tests of composite statistical hypotheses In Probability and Statistics: The Harald Cramér Volume (Grenander U, ed.) 213–234. Almqvist & Wiksell, Stockholm. [Google Scholar]
[38].Neyman J (1979). c(α) tests and their use. Sankhyā 41 1–21. [Google Scholar]
[39].Ning Y and Liu H (2014). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Available at arXiv:1412.8765.
[40].Pötscher B and Leeb H (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. Multivariate Anal 100 2065–2082. [Google Scholar]
[41].Robins J and Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc 90 122–129. [Google Scholar]
[42].Stein C (1956). Efficient nonparametric testing and estimation In Proc. 3rd Berkeley Symp. Math. Statist. and Probab. 1 187–195. Univ. California Press, Berkeley, CA. [Google Scholar]
[43].Van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]
[44].Van der Vaart A (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge. [Google Scholar]
[45].Van der Vaart A and Wellner J (1996). Weak Convergence and Empirical Processes.
[46].Zhang C-H and Zhang S (2014). Confidence intervals for low-dimensional parameters with high-dimensional data. J. Roy. Statist. Soc. Ser. B 76 217–242. [Google Scholar]
[47].Zhao T, Kolar M and Liu H (2014). A general framework for robust testing and confidence regions in high-dimensional quantile regression. Available at arXiv:1412.8724.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

NIHMS1008366-supplement-Supplemental_Material.pdf^{(546.7KB, pdf)}

[R1] [1].Andrews DWK (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62 43–72. [Google Scholar]

[R2] [2].Belloni A, Chen D, Chernozhukov V and Hansen C (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429. [Google Scholar]

[R3] [3].Belloni A and Chernozhukov V (2011). ℓ₁-Penalized quantile regression for high dimensional sparse models. Ann. Statist 39 82–130. [Google Scholar]

[R4] [4].Belloni A and Chernozhukov V (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547. Available at arXiv:1001.0188. [Google Scholar]

[R5] [5].Belloni A, Chernozhukov V, Chetverikov D and Wei Y (2018). Supplementto “Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework.” DOI: 10.1214/17-AOS1671SUPP. [DOI] [PMC free article] [PubMed]

[R6] [6].Belloni A, Chernozhukov V, Fernández-Val I and Hansen C (2013). Program evaluation with high-dimensional data. Available at arXiv:1311.2645.

[R7] [7].Belloni A, Chernozhukov V and Hansen C (2010). Lasso methods for Gaussian instrumental variables models. Available at arXiv:1012.1297.

[R8] [8].Belloni A, Chernozhukov V and Hansen C (2013). Inference for high-dimensional sparse econometric models. In Advances in Economics and Econometrics. 10th World Congress of Econometric Society, August 2010, Vol. III. 245–295. Available at arXiv:1201.0220. [Google Scholar]

[R9] [9].Belloni A, Chernozhukov V and Hansen C (2014). Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud 81 608–650. [Google Scholar]

[R10] [10].Belloni A, Chernozhukov V and Kato K (2013). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Available at arXiv:1312.7186.

[R11] [11].Belloni A, Chernozhukov V and Kato K (2015). Uniform post selection inference for LAD regression models and other Z-estimators. Biometrika 102 77–94. [Google Scholar]

[R12] [12].Belloni A, Chernozhukov V and Wang L (2011). Square-root-lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. [Google Scholar]

[R13] [13].Belloni A, Chernozhukov V and Wang L (2014). Pivotal estimation via square-root Lasso in nonparametric regression. Ann. Statist 42 757–788. [Google Scholar]

[R14] [14].Chamberlain G (1992). Efficiency bounds for semiparametric regression. Econometrica 60 567–596. [Google Scholar]

[R15] [15].Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W and Robins J (2018). Double/debiased machine learning for treatment and structural parameters. Econom. J 21 C1–C68. [Google Scholar]

[R16] [16].Chernozhukov V, Chetverikov D and Kato K (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist 41 2786–2819. [Google Scholar]

[R17] [17].Chernozhukov V, Chetverikov D and Kato K (2014). Anti-concentration and honest, adaptive confidence bands. Ann. Statist 42 1787–1818. [Google Scholar]

[R18] [18].Chernozhukov V, Chetverikov D and Kato K (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab 4 2309–2352. [Google Scholar]

[R19] [19].Chernozhukov V, Chetverikov D and Kato K (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist 42 1564–1597. [Google Scholar]

[R20] [20].Chernozhukov V, Chetverikov D and Kato K (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70. [Google Scholar]

[R21] [21].Chernozhukov V, Chetverikov D and Kato K (2015). Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings. Available at arXiv:1502.00352.

[R22] [22].Chernozhukov V, Fernández-Val I and Melly B (2013). Inference on counterfactual distributions. Econometrica 81 2205–2268. [Google Scholar]

[R23] [23].Chernozhukov V, Hansen C and Spindler M (2015). Post-selection and post regularization inference in linear models with very many controls and instruments. Am. Econ. Rev. Pap. Proc 105 486–490. [Google Scholar]

[R24] [24].Deng H and ZHANG C-H (2017). Beyond Gaussian approximation: Bootstrap for maxima of sums of independent random vectors. Available at arXiv:1705.09528.

[R25] [25].Dudley R (1999). Uniform Central Limit Theorems Cambridge Studies in Advanced Mathematics 63. Cambridge Univ. Press, Cambridge. [Google Scholar]

[R26] [26].Hothorn T, Kneib T and Bühlmann P (2014). Conditional transformation models.J. Roy. Statist. Soc. Ser. B 76 3–27. [Google Scholar]

[R27] [27].Javanmard A and Montanari A (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res 15 2869–2909. [Google Scholar]

[R28] [28].Javanmard A and Montanari A (2014). Hypothesis testing in high-dimensional regression under the Gaussian random design model: Asymptotic theory. IEEE Trans. Inform. Theory 60 6522–6554. [Google Scholar]

[R29] [29].Kosorok M (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin. [Google Scholar]

[R30] [30].Leeb H and Pötscher B (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376. [Google Scholar]

[R31] [31].Leeb H and Pötscher B (2008). Recent developments in model selection and related areas. Econometric Theory 24 319–322. [Google Scholar]

[R32] [32].Leeb H and Pötscher BM (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. J. Econometrics 142 201–211. [Google Scholar]

[R33] [33].Linton O (1996). Edgeworth approximation for MINPIN estimators in semiparametric regression models. Econometric Theory 12 30–60. Cowles Foundation Discussion Papers 1086 (1994). [Google Scholar]

[R34] [34].Mammen E (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist 21 255–285. [Google Scholar]

[R35] [35].Newey W (1990). Semiparametric efficiency bounds. J. Appl. Econometrics 5 99–135. [Google Scholar]

[R36] [36].Newey W (1994). The asymptotic variance of semiparametric estimators. Econometrica 62 1349–1382. [Google Scholar]

[R37] [37].Neyman J (1959). Optimal asymptotic tests of composite statistical hypotheses In Probability and Statistics: The Harald Cramér Volume (Grenander U, ed.) 213–234. Almqvist & Wiksell, Stockholm. [Google Scholar]

[R38] [38].Neyman J (1979). c(α) tests and their use. Sankhyā 41 1–21. [Google Scholar]

[R39] [39].Ning Y and Liu H (2014). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Available at arXiv:1412.8765.

[R40] [40].Pötscher B and Leeb H (2009). On the distribution of penalized maximum likelihood estimators: The LASSO, SCAD, and thresholding. J. Multivariate Anal 100 2065–2082. [Google Scholar]

[R41] [41].Robins J and Rotnitzky A (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc 90 122–129. [Google Scholar]

[R42] [42].Stein C (1956). Efficient nonparametric testing and estimation In Proc. 3rd Berkeley Symp. Math. Statist. and Probab. 1 187–195. Univ. California Press, Berkeley, CA. [Google Scholar]

[R43] [43].Van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42 1166–1202. [Google Scholar]

[R44] [44].Van der Vaart A (1998). Asymptotic Statistics. Cambridge Univ. Press, Cambridge. [Google Scholar]

[R45] [45].Van der Vaart A and Wellner J (1996). Weak Convergence and Empirical Processes.

[R46] [46].Zhang C-H and Zhang S (2014). Confidence intervals for low-dimensional parameters with high-dimensional data. J. Roy. Statist. Soc. Ser. B 76 217–242. [Google Scholar]

[R47] [47].Zhao T, Kolar M and Liu H (2014). A general framework for robust testing and confidence regions in high-dimensional quantile regression. Available at arXiv:1412.8724.

PERMALINK

UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK

Alexandre Belloni

Victor Chernozhukov

Denis Chetverikov

Ying Wei

Abstract

1. Introduction.

2. Confidence regions for function-valued parameters based on moment conditions.

2.1. Generic construction of confidence regions.

2.2. Construction of score functions satisfying the orthogonality condition.

3. Application to logistic regression model with functional response data.

3.1. Model.

3.2. Orthogonal score functions.

3.3. Estimation using orthogonal score functions.

3.4. Regularity conditions.

3.5. Formal results.

4. ℓ₁-Penalized M-estimators: Nuisance functions and functional data.

4.1. ℓ₁-Penalized logistic regression for functional response data: Asymptotic properties.

4.2. Lasso with estimated weights: Asymptotic properties.

5. Monte Carlo simulations.

Table 1.

Table 2.

Supplementary Material

APPENDIX A: PROOFS FOR SECTION 2

Proof of Theorem 2.1.

APPENDIX B: REMAINING PROOFS FOR SECTION 2

APPENDIX C: PROOFS FOR SECTIONS 3 AND 4

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

UNIFORMLY VALID POST-REGULARIZATION CONFIDENCE REGIONS FOR MANY FUNCTIONAL PARAMETERS IN Z-ESTIMATION FRAMEWORK

Alexandre Belloni

Victor Chernozhukov

Denis Chetverikov

Ying Wei

Abstract

1. Introduction.

2. Confidence regions for function-valued parameters based on moment conditions.

2.1. Generic construction of confidence regions.

2.2. Construction of score functions satisfying the orthogonality condition.

3. Application to logistic regression model with functional response data.

3.1. Model.

3.2. Orthogonal score functions.

3.3. Estimation using orthogonal score functions.

3.4. Regularity conditions.

3.5. Formal results.

4. ℓ1-Penalized M-estimators: Nuisance functions and functional data.

4.1. ℓ1-Penalized logistic regression for functional response data: Asymptotic properties.

4.2. Lasso with estimated weights: Asymptotic properties.

5. Monte Carlo simulations.

Table 1.

Table 2.

Supplementary Material

APPENDIX A: PROOFS FOR SECTION 2

Proof of Theorem 2.1.

APPENDIX B: REMAINING PROOFS FOR SECTION 2

APPENDIX C: PROOFS FOR SECTIONS 3 AND 4

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4. ℓ₁-Penalized M-estimators: Nuisance functions and functional data.

4.1. ℓ₁-Penalized logistic regression for functional response data: Asymptotic properties.