Correcting an estimator of a multivariate monotone function with isotonic regression

Ted Westling; Mark J van der Laan; Marco Carone

doi:10.1214/20-ejs1740

. Author manuscript; available in PMC: 2021 May 11.

Published in final edited form as: Electron J Stat. 2020 Aug 15;14(2):3032–3069. doi: 10.1214/20-ejs1740

Correcting an estimator of a multivariate monotone function with isotonic regression

Ted Westling ¹, Mark J van der Laan ², Marco Carone ³

PMCID: PMC8112587 NIHMSID: NIHMS1624095 PMID: 33981382

Abstract

In many problems, a sensible estimator of a possibly multivariate monotone function may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected confidence bands contain the true function whenever the initial bands do, at no loss to band width. Additionally, we demonstrate that the corrected estimator is asymptotically equivalent to the initial estimator if the initial estimator satisfies a stochastic equicontinuity condition and the true function is Lipschitz and strictly monotone. We provide simple sufficient conditions in the special case that the initial estimator is asymptotically linear, and illustrate the use of these results for estimation of a G-computed distribution function. Our stochastic equicontinuity condition is weaker than standard uniform stochastic equicontinuity, which has been required for alternative correction procedures. This allows us to apply our results to the bivariate correction of the local linear estimator of a conditional distribution function known to be monotone in its conditioning argument. Our experiments suggest that the projection step can yield significant practical improvements.

MSC2020 subject classifications: Primary 62G20, secondary 60G15

Keywords: Asymptotic linearity, confidence band, kernel smoothing, projection, shape constraint, stochastic equicontinuity

1. Introduction

1.1. Background

In many scientific problems, the parameter of interest is a component-wise monotone function. In practice, an estimator of this function may have several desirable statistical properties, yet fail to be monotone. This often occurs when the estimator is obtained through the pointwise application of a statistical procedure over the domain of the function. For instance, we may be interested in estimating a conditional cumulative distribution function θ₀, defined pointwise as θ₀(a, y) = P₀(Y ≤ y ∣ A = a), over its domain $D \subset R^{2}$ . Here, Y may represent an outcome and A an exposure. The map y ↦ θ₀(a, y) is necessarily monotone for each fixed a. In some scientific contexts, it may be known that a ↦ θ₀(a, y) is also monotone for each y, in which case θ₀ is a bivariate component-wise monotone function. An estimator of θ₀ can be constructed by estimating the regression function (a, y) ↦ E_P₀ [I(Y ≤ y) ∣ A = a] for each (a, y) on a finite grid using kernel smoothing, and performing suitable interpolation elsewhere. For some types of kernel smoothing, including the Nadaraya-Watson estimator, the resulting estimator is necessarily monotone as a function of y for each value of a, but not necessarily monotone as a function of a for each value of y. For other types of kernel smoothing, including the local linear estimator, which often has smaller asymptotic bias than the Nadaraya-Watson estimator, the resulting estimator need not be monotone in either component.

Whenever the function of interest is component-wise monotone, failure of an estimator to itself be monotone can be problematic. This is most apparent if the monotonicity constraint is probabilistic in nature – that is, the parameter mapping is monotone under all possible probability distributions. This is the case, for instance, if θ₀ is a distribution function. In such settings, returning a function estimate that fails to be monotone is nonsensical, like reporting a probability estimate outside the interval [0, 1]. However, even if the monotonicity constraint is based on scientific knowledge rather than probabilistic constraints, failure of an estimator to be monotone can be an issue. For example, if the parameter of interest represents average height or weight among children as a function of age, scientific collaborators would likely be unsatisfied if presented with an estimated curve that were not monotone. Finally, as we will see, there are often finite-sample performance benefits to ensuring that the monotonicity constraint is respected.

Whenever this phenomenon occurs, it is natural to seek an estimator that respects the monotonicity constraint but nevertheless remains close to the initial estimator, which may otherwise have good statistical properties. A monotone estimator can be naturally constructed by projecting the initial estimator onto the space of monotone functions with respect to some norm. A common choice is the L₂-norm, which amounts to using multivariate isotonic regression to correct the initial estimator.

1.2. Contribution and organization of the article

In this article, we discuss correcting an initial estimator of a multivariate monotone function by computing the isotonic regression of the estimator over a finite grid in the domain, and interpolating between grid points. We also consider correcting an initial confidence band by using the same procedure applied to the upper and lower limits of the band. We provide three general results regarding this simple procedure.

Building on the results of Robertson, Wright and Dykstra (1988) and Chernozhukov, Fernández-Val and Galichon (2009), we demonstrate that the corrected estimator is at least as good as the initial estimator, meaning:
1. its uniform error over the grid used in defining the projection is less than or equal to that of the initial estimator for every sample;
2. its uniform error over the entire domain is less than or equal to that of the initial estimator asymptotically;
3. the corrected confidence band contains the true function on the projection grid whenever the initial band does, at no cost in terms of average or uniform band width.
We provide high-level sufficient conditions under which the uniform difference between the initial and corrected estimators is $O_{P} (r_{n}^{- 1})$ for a generic sequence r_n → ∞.
We provide simpler lower-level sufficient conditions in two special cases:
1. when the initial estimator is uniformly asymptotically linear, in which case the appropriate rate is r_n = n^1/2;
2. when the initial estimator is kernel-smoothed with bandwidth h_n, in which case the appropriate rate is r_n = (nh_n)^1/2 for univariate kernel smoothing.

We apply our theoretical results to two sets of examples: nonparametric efficient estimation of a G-computed distribution function for a binary exposure, and local linear estimation of a conditional distribution function with a continuous exposure.

Other authors have considered the correction of an initial estimator using isotonic regression. To name a few, Mukarjee and Stern (1994) used a projectionlike procedure applied to a kernel smoothing estimator of a regression function, whereas Patra and Sen (2016) used the projection procedure applied to a univariate cumulative distribution function in the context of a mixture model. These articles addressed the properties of the projection procedure in their specific applications. In contrast, we provide general results that are applicable broadly.

1.3. Alternative projection procedures

The projection approach is not the only possible correction procedure. Dette, Neumeyer and Pilz (2006), Chernozhukov, Fernández-Val and Galichon (2009), and Chernozhukov, Fernández-Val and Galichon (2010) studied a correction based on monotone rearrangements. However, monotone rearrangements do not generalize to the multivariate setting as naturally as projections — for example, Chernozhukov, Fernández-Val and Galichon (2009) proposed averaging a variety of possible multivariate monotone rearrangements to obtain a final monotone estimator. In contrast, the L₂ projection of an initial estimator onto the space of component-wise monotone functions is uniquely defined, even in the context of multivariate functions.

Daouia and Park (2013) proposed an alternative correction procedure that consists of taking a convex combination of upper and lower monotone envelope functions, and they demonstrated conditions under which their estimator is asymptotically equivalent in supremum norm to the initial estimator. There are several differences between our contributions and those of Daouia and Park (2013). For instance, Daouia and Park (2013) did not study correction of confidence bands, which we consider in Section 2.3, or the important special case of asymptotically linear estimators, which we consider in Section 3.1. Our results in these two sections apply equally well to our correction procedure and to the correction procedure considered by Daouia and Park (2013).

Perhaps the most important theoretical contribution of our work beyond that of existing research is the weaker form of stochastic equicontinuity that we require for establishing asymptotic equivalence of the initial and projected estimators. In contrast, Daouia and Park (2013) explicitly required the usual uniform asymptotic equicontinuity, while application of the Hadamard differentiability results of Chernozhukov, Fernández-Val and Galichon (2010) requires weak convergence to a tight limit, which is stronger than uniform asymptotic equicontinuity. Our weaker condition allows us to use our general results to tackle a broader range of initial estimators, including kernel smoothed estimators, which are typically not uniformly asymptotically equicontinuous at useful rates, but nevertheless can frequently be shown to satisfy our condition. We discuss this in detail in Section 3.2. We illustrate this general contribution in Section 4.2 by studying the bivariate correction of a conditional distribution function estimated using local linear regression, which would not be possible using the stronger asymptotic equicontinuity condition. In numerical studies, we find that the projected estimator and confidence bands can offer substantial finite-sample improvements over the initial estimator and bands in this example.

2. Main results

2.1. Definitions and statistical setup

Let $M$ be a statistical model of probability measures on a probability space ( $X$ , $B$ ). Let $θ : M \to ℓ^{\infty} (T)$ be a parameter of interest on $M$ , where $T ≔ [0, 1]^{d}$ and $ℓ^{\infty} (T)$ is the Banach space of bounded functions from $T$ to $R$ equipped with supremum norm $‖ \cdot ‖_{T}$ . We have specified this particular $T$ for simplicity, but the results established here apply to any bounded rectangular domain $T \subset R^{d}$ . For each $P \in M$ , denote by θ_P the evaluation of θ at P and note that θ_P is a bounded real-valued function on $T$ . For any $t \in T$ , denote by $θ_{P} (t) \in R$ the evaluation of θ_P at t.

For any vector $t \in R^{d}$ and 1 ≤ j ≤ d, denote by t_j the j^th component of t. Define the partial order ≤ on $R^{d}$ by setting t ≤ t′ if and only if $t_{j} \leq t_{j}^{'}$ for each 1 ≤ j ≤ d. A function $f : R^{d} \to R$ is called (component-wise) monotone non-decreasing if t ≤ t′ implies that f(t) ≤ f(t′). Denote ∥t∥ = max_1≤j≤d ∣t_j∣ for any vector $t \in R^{d}$ . Additionally, denote by $Θ \subset ℓ^{\infty} (T)$ the convex set of bounded monotone non-decreasing functions from $T$ to $R$ . For concreteness, we focus on non-decreasing functions, but all results established here apply equally to non-increasing functions.

Let $M_{0} ≔ {P \in M : θ_{P} \in Θ} \subseteq M$ and suppose that $M_{0}$ is nonempty. Generally, this inclusion is strict only if, rather than being implied by the rules of probability, the monotonicity constraint stems at least in part from prior scientific knowledge. Also, define Θ₀ := {θ ∈ Θ : θ = θ_P for some $P \in M$ } ⊆ Θ. We are primarily interested in settings where Θ₀ = Θ, since in this case there is no additional knowledge about θ encoded by $M$ , and in particular there is no danger of yielding a corrected estimator that is compatible with no $P \in M$ .

Suppose that observations X₁, X₂, … , X_n are sampled independently from an unknown distribution $P_{0} \in M_{0}$ , and that we wish to estimate θ₀ := θ_P₀ based on these observations. Suppose that, for each $t \in T$ , we have access to an estimator θ_n(t) of θ₀(t) based on X₁, X₂, … , X_n. We note that the assumption that the data are independent and identically distributed is not necessary for Theorems 1 and 2 below. For any suitable $f : X \to R$ , we define Pf := ʃ f(x) P(dx) and $G_{n} f ≔ n^{1 ∕ 2} \int f (x) (P_{n} - P_{0}) (d x)$ , where $P_{n}$ is the empirical distribution based on X₁, X₂, … , X_n.

The central premise of this article is that θ_n(t) may have desirable statistical properties for each t or even uniformly in t, but that θ_n as an element of $ℓ^{\infty (T)}$ may not fall in Θ for any finite n or even with probability tending to one. Our goal is to provide a corrected estimator $θ_{n}^{*}$ that necessarily falls in Θ, and yet retains the statistical properties of θ_n. A natural way to accomplish this is to define $θ_{n}^{*}$ as the closest element of Θ to θ_n in some norm on $T$ . Ideally, we would prefer to take $θ_{n}^{*}$ to minimize $‖ θ - θ_{n} ‖_{T}$ over θ ∈ Θ. However, this is not tractable for two reasons. First, optimization over the entirety of $T$ is an infinite-dimensional optimization problem, and is hence frequently computationally intractable. To resolve this issue, for each n, we let $T_{n} = {t_{1}, t_{2}, \dots, t_{m_{n}}} \subseteq T$ be a finite rectangular lattice in $T$ over which we will perform the optimization, and define and consider $‖ \cdot ‖_{T_{n}}$ . as the supremum norm over $T_{n}$ . While it is now computationally feasible to define $θ_{n, \infty}^{*}$ as a minimizer over θ ∈ Θ of the finite-dimensional objective function $‖ θ - θ_{n} ‖_{T_{n}}$ , this objective function is challenging due to its non-differentiability. Instead, we define

θ_{n}^{*} \in \underset{θ \in Θ}{argmin} \sum_{t \in T_{n}} [θ (t) - θ_{n} (t)]^{2} .

(2.1)

The squared-error objective function is smooth in its arguments. In dimension d = 1, $θ_{n}^{*}$ thus defined is simply the isotonic regression of θ_n on the grid $T_{n}$ , which has a closed-form representation as the greatest convex minorant of the so-called cumulative sum diagram. Furthermore, since $‖ θ_{n}^{*} - θ_{n} ‖_{T_{n}} \geq ‖ θ_{n, \infty}^{*} - θ_{n} ‖_{T_{n}}$ , many of our results also apply to $θ_{n, \infty}^{*}$ .

We note that $θ_{n}^{*}$ is only uniquely defined on $T_{n}$ . To completely characterize $θ_{n}^{*}$ , we must monotonically interpolate function values between elements of $T_{n}$ . We will permit any monotonic interpolation that satisfies a weak condition. By the definition of a rectangular lattice, every $t \in T$ can be assigned a hyper-rectangle whose vertices {s₁, s₂ … , s₂d} are elements of $T_{n}$ and whose interior has empty intersection with $T_{n}$ . If multiple such hyper-rectangles exist for t, such as when t lies on the boundary of two or more such hyper-rectangles, one can be assigned arbitrarily. We will assume that, for $t \notin T_{n}$ ,

θ_{n}^{*} (t) = \sum_{k} λ_{k, n} (t) θ_{n}^{*} (s_{k})

for weights λ_1,n(t), λ_2,n(t), … , λ_2^d,n(t) ∈ (0, 1) such that Σ_k λ_k,n(t) = 1. In words, we assume that $θ_{n}^{*} (t)$ is a convex combination of the values of $θ_{n}^{*}$ on the vertices of the hyper-rectangle containing t. A simple interpolation approach consists of setting $θ_{n}^{*} (t) = θ_{n}^{*} (t^{'})$ with t′ the element of $T_{n}$ closest to t, and choosing any such element if there are multiple elements of $T_{n}$ equally close to t. This particular scheme satisfies our requirement.

Finally, for each n, we let ℓ_n(t) ≤ u_n(t) denote lower and upper endpoints of a confidence band for θ₀(t). We then define $ℓ_{n}^{*}$ and $u_{n}^{*}$ as the corrected versions of ℓ_n and u_n using the same projection and interpolation procedure defined above for obtaining $θ_{n}^{*}$ from $θ_{n}$ .

In dimension d = 1, $θ_{n}^{*} (t)$ , $ℓ_{n}^{*} (t)$ , and $u_{n}^{*} (t)$ can be obtained for $t \in T_{n}$ via the Pool Adjacent Violators Algorithm (Ayer et al., 1955), as implemented in the R command isoreg (R Core Team, 2018). In dimension d = 2, the corrections can be obtained using the algorithm described in Bril et al. (1984), which is implemented in the R command biviso in the package Iso (Turner, 2015). In dimension d ≥ 3, Kyng, Rao and Sachdeva (2015) provides algorithms for computing the isotonic regression based on embedding the points in a directed acyclic graph. Alternatively, general-purpose algorithms for minimization of quadratic criteria over convex cones have been developed and implemented in the R package coneproj and may be used in this case (Meyer, 1999; Liao and Meyer, 2014).

2.2. Properties of the projected estimator

The projected estimator $θ_{n}^{*}$ is the isotonic regression of θ_n over the grid $T_{n}$ . Hence, many existing finite-sample results on isotonic regression can be used to deduce properties of $θ_{n}^{*}$ . Theorem 1 below collects a few of these properties, building upon the results of Barlow et al. (1972) and Chernozhukov, Fernández-Val and Galichon (2009). We denote $ω_{n} ≔ {sup}_{t \in T} {min}_{s \in T_{n}} ‖ t - s ‖$ as the mesh of $T_{n}$ in $T$ .

Theorem 1. (i) It holds that $‖ θ_{n}^{*} - θ_{0} ‖_{T_{n}} \leq ‖ θ_{n} - θ_{0} ‖_{T_{n}}$ .

(ii) If ω_n = o_P(1) and θ₀ is continuous on $T$ , then

‖ θ_{n}^{*} - θ_{0} ‖_{T} \leq ‖ θ_{n} - θ_{0} ‖_{T_{n}} + o_{P} (1) .

(iii) If there exists some α > 0 for which ${sup}_{s, t \in T : ‖ t - s ‖ \leq δ} ∣ θ_{0} (t) - θ_{0} (s) ∣ = O (δ^{α})$ as δ → 0, then

‖ θ_{n}^{*} - θ_{0} ‖_{T} \leq ‖ θ_{n} - θ_{0} ‖_{T_{n}} + o_{P} (ω_{n}^{α}) .

(iv) If θ₀(t) ∈ [ℓ_n(t), u_n(t)] for all $t \in T_{n}$ , then $θ_{0} (t) \in [ℓ_{n}^{*} (t), u_{n}^{*} (t)]$ for all $t \in T_{n}$ .

(v) It holds that $‖ u_{n}^{*} - ℓ_{n}^{*} ‖_{T_{n}} \leq ‖ u_{n} - ℓ_{n} ‖_{T_{n}}$ and

\sum_{t \in T_{n}} [u_{n}^{*} (t) - ℓ_{n}^{*} (t)] = \sum_{t \in T_{n}} [u_{n} (t) - ℓ_{n} (t)] .

Theorem 1 is proved in Appendix A.1. We remark briefly on the implications of Theorem 1. Part (i) says that the estimation error of $θ_{n}^{*}$ over the grid $T_{n}$ is never worse than that of θ_n, whereas parts (ii) and (iii) provide bounds on the estimation error of $θ_{n}^{*}$ on all of $T$ in supremum norm. In particular, part (ii) indicates that $θ_{n}^{*}$ is uniformly consistent on $T$ as long as θ_n is uniformly consistent on $T$ , θ₀ is continuous on $T$ , and ω_n = o_P(1). Part (iii) provides an upper bound on the uniform rate of convergence of $θ_{n}^{*}$ − θ₀, and indicates that if θ₀ is known to lie in a Hölder class, then ω_n can be chosen in such a way as to guarantee that the estimation error of $θ_{n}^{*}$ on all of $T$ is asymptotically no worse than the estimation error of θ_n on $T_{n}$ in supremum norm. We note that parts (i)–(iii) also hold for the L_p norm with respect to uniform measure on $T$ for any p ∈ [1, ∞). Part (iv) guarantees that the isotonized band [ $ℓ_{n}^{*}$ , $u_{n}^{*}$ ] never has worse coverage than the original band over $T_{n}$ . Finally, part (v) states that the potential increase in coverage comes at no cost to the average or supremum width of the bands over $T_{n}$ . We note that parts (i), (iv) and (v) hold true for each n.

While comprehensive in scope, Theorem 1 does not rule out the possibility that $θ_{n}^{*}$ performs strictly better, even asymptotically, than θ_n, or that the band [ $ℓ_{n}^{*}$ , $u_{n}^{*}$ ] is asymptotically strictly more conservative than [ℓ_n, u_n]. In order to construct confidence intervals or bands with correct asymptotic coverage, a stronger result is needed: it must be that $‖ θ_{n}^{*} - θ_{n} ‖_{T} = O_{P} (r_{n}^{- 1})$ , where r_n is a diverging sequence such that $r_{n} ‖ θ_{n} - θ_{0} ‖_{T}$ converges in distribution to a non-degenerate limit distribution. Then, we would have that $r_{n} ‖ θ_{n}^{*} - θ_{0} ‖_{T}$ converges in distribution to this same limit, and hence confidence bands constructed using approximations of this limit distribution would have correct coverage when centered around $θ_{n}^{*}$ , as we discuss more below.

We consider the following conditions on θ₀ and the initial estimator θ_n:

there exists a deterministic sequence r_n tending to infinity such that, for all δ > 0,
$sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ r_{n} [θ_{n} (t) - θ_{0} (t)] - r_{n} [θ_{n} (s) - θ_{0} (s)] ∣ = o_{P} (1);$
there exists K₁ < ∞ such that ∣θ₀(t) − θ₀(s)∣ ≤ K₁∥t − s∥ for all t, $s \in T$ ;
there exists K₀ > 0 such that K₀∥t − s∥ ≤ ∣θ₀(t) − θ₀(s)∣ for all t, $s \in T$ .

Based on these conditions, we have the following result.

Theorem 2. If (A)–(C) hold and $ω_{n} = O_{P} (r_{n}^{- 1})$ , then $‖ θ_{n}^{*} - θ_{n} ‖_{T} = O_{P} (r_{n}^{- 1})$ .

The proof of Theorem 2 is presented in Appendix A.2. This result indicates that the projected estimator is uniformly asymptotically equivalent to the original estimator in supremum norm at the rate r_n.

Condition (A) is related to, but notably weaker than, uniform stochastic equicontinuity (van der Vaart and Wellner, 1996, p. 37). (A) follows if, in particular, the process ${r_{n} [θ_{n} (t) - θ_{0} (t)] : t \in T}$ converges weakly to a tight limit in the space $ℓ^{\infty} (T)$ . However, the latter condition is sufficient but not necessary for (A) to hold. This is important for application of our results to kernel smoothing estimators, which typically do not converge weakly to a tight limit, but for which condition (A) nevertheless often holds. We discuss this at length in Section 4.2. The results of Daouia and Park (2013) (see in particular condition (C3) therein) and Chernozhukov, Fernández-Val and Galichon (2010) rely on uniform stochastic equicontinuity in demonstrating asymptotic equivalence of their correction procedures, which essentially limits the applicability of their procedures to estimators that converge weakly to a tight limit in $ℓ^{\infty} (T)$ .

Condition (B) constrains θ₀ to be Lipschitz. Condition (C) constrains the variation of θ₀ from below, and is slightly more restrictive than a requirement for strict monotonicity. If, for instance, θ₀ is differentiable, then (C) is satisfied if all first-order partial derivatives of θ₀ are bounded away from zero. Condition (C) excludes, for instance, situations in which θ₀ is differentiable with null derivative over an interval. In such cases, $θ_{n}^{*}$ may have strictly smaller variance on these intervals than θ_n because $θ_{n}^{*}$ will pool estimates across the flat region while θ_n may not. Hence, in such cases, $θ_{n}^{*}$ may potentially asymptotically improve on θ_n, so that $θ_{n}^{*}$ and θ_n are not asymptotically equivalent at the rate r_n. Theoretical results in these cases would be of interest, but are beyond the scope of this article.

In addition to conditions (A)–(C), Theorem 2 requires that the mesh ω_n of $T_{n}$ tend to zero in probability faster than $r_{n}^{- 1}$ . Since $T_{n}$ is chosen by the user, as long as r_n (or an upper bound thereof) is known, this is not a problem in practice. Furthermore, except in irregular problems, the rate of convergence is typically not faster than n^−1/2, and hence it is typically sufficient to set ω_n = c_nn^−1/2 for some c_n = o(1). We note, however, that the computational complexity of obtaining the isotonic regression of θ_n over $T_{n}$ increases as ω_n decreases. Hence, in cases where the rate of convergence of the initial estimator is strictly slower than n^−1/2, it may be preferable to choose ω_n more carefully based on a precise determination of r_n. We expect this to be especially true in the context of large d and n.

We note that conditions (A)–(C) and $ω_{n} = O_{P} (r_{n}^{- 1})$ also imply that $‖ θ_{n}^{*} - θ_{n} ‖_{L_{p} (T)} = O_{P} (r_{n}^{- 1})$ for any p ∈ [1, ∞), where $L_{p} (T)$ is the L_p norm on $T$ with respect to uniform measure on $T$ . However, it may be possible to relax conditions (A)–(C) for the purpose of demonstrating L_p asymptotic equivalence of $θ_{n}^{*}$ and θ_n for p < ∞. It is not clear whether our method of proof of Theorem 2 is amenable to such weakening. We have chosen to focus on uniform asymptotic equivalence in part for its use in constructing uniform confidence bands for θ₀, as we discuss in the next section.

2.3. Construction of confidence bands

Suppose there exists a fixed function $γ_{α} : T \to R$ such that ℓ_n and u_n satisfy:

$‖ r_{n} (θ_{n} - ℓ_{n}) - γ_{α} ‖_{T} \to_{P} 0$ ;
$‖ r_{n} (u_{n} - θ_{n}) - γ_{α} ‖_{T} \to_{P} 0$ ;
P₀ [r_n∣θ_n(t) − θ₀(t)∣ ≥ γ_α(t) for all $t \in T$ ] → 1 − α.

As an example of a confidence band that satisfies conditions (a)–(c), suppose that $σ_{0} : T \to (0, + \infty)$ is a scaling function and c_α is a fixed constant such that, as n tends to infinity,

P_{0} (r_{n} {‖ \frac{θ_{n} - θ_{0}}{σ_{0}} ‖}_{T} \geq c_{α}) \to 1 - α .

If σ_n is an estimator of σ₀ satisfying $‖ σ_{n} - σ_{0} ‖_{T} \to_{P}$ 0 and c_α,n is an estimator of c_α such that c_α,n →_P c_α, then the Wald-type band defined by lower and upper endpoints $ℓ_{n} (t) ≔ θ_{n} (t) - c_{α, n} r_{n}^{- 1} σ_{n} (t)$ and $u_{n} (t) ≔ θ_{n} (t) + c_{α} r_{n}^{- 1} σ_{n} (t)$ satisfies (a)–(c) with γ_α = c_ασ₀. However, the latter conditions can also be satisfied by other types of bands, such as those constructed with a consistent bootstrap procedure.

Under conditions (a)–(c), the confidence band [ℓ_n, u_n] has asymptotic coverage 1 − α. When conditions (A) and (B) also hold, the corrected band [ $ℓ_{n}^{*}$ , $u_{n}^{*}$ ] has the same asymptotic coverage as the original band [ℓ_n, u_n], as stated in the following result.

Corollary 1. If (A)–(B) and (a)–(c) hold, γ_α is uniformly continuous on $T$ , and $ω_{n} = O_{P} (r_{n}^{- 1})$ , then the band [ $ℓ_{n}^{*}$ , $u_{n}^{*}$ ] has asymptotic coverage 1 − α.

The proof of Corollary 1 is presented in Appendix A.3. We also note that Theorem 2 implies that Wald-type confidence bands constructed around θ_n have the same asymptotic coverage if they are constructed around $θ_{n}^{*}$ instead.

3. Refined results under additional structure

In this section, we provide more detailed conditions that imply condition (A) in two special cases: when θ_n is asymptotically linear, and when θ_n is a kernel smoothing-type estimator.

3.1. Special case I: asymptotically linear estimators

Suppose that the initial estimator θ_n is uniformly asymptotically linear (UAL): for each $t \in T$ , there exists $ϕ_{0, t} : X \mapsto R$ depending on P₀ such that ʃ ϕ_0,t dP₀ = 0, $\int ϕ_{0, t}^{2} d P_{0} < \infty$ and

θ_{n} (t) = θ_{0} (t) + \frac{1}{n} \sum_{i = 1}^{n} ϕ_{0, t} (X_{i}) + R_{n, t}

(3.1)

for a remainder term R_n,t with n^1/2 ${sup}_{t \in T} ∣ R_{n, t} ∣ = O_{P} (1)$ . The function ϕ_0,t is the influence function of θ_n(t) under sampling from P₀. It is desirable for θ_n to have representation (3.1) because this implies its uniform weak consistency as well as the pointwise asymptotic normality of n^1/2 [θ_n(t) − θ₀(t)] for each $t \in T$ . If in addition the collection { $ϕ_{0, t} : t \in T$ } of influence functions forms a P₀-Donsker class, then { $n^{1 ∕ 2} [θ_{n} (t) - θ_{0} (t)] : t \in T$ } converges weakly in $ℓ^{\infty} (T)$ to a Gaussian process with covariance function Σ₀ : (t, s) ↦ ʃ ϕ_0,t(x)ϕ_0,s(x)dP₀(x). Uniform asymptotic confidence bands based on θ_n can then be formed by using appropriate quantiles from any suitable approximation of the distribution of the supremum of the limiting Gaussian process.

We introduce two additional conditions:

(A1) the collection { $ϕ_{0, t} : t \in T$ } of influence curves is a P₀-Donsker class;

(A2) Σ₀ is uniformly continuous in the sense that

\underset{‖ t - s ‖ \to 0}{lim sup} ∣ Σ_{0} (s, t) - Σ_{0} (t, t) ∣ = 0 .

Whenever θ_n is uniformly asymptotically linear, Theorem 2 can be shown to hold under (A1), (A2) and (B), as implied by the theorem below. The validity of (A1) and (A2) can be assessed by scrutinizing the influence function ϕ_0,t of θ_n(t) for each $t \in T$ . This fact renders the verification of these conditions very simple once uniform asymptotic linearity has been established.

Theorem 3. For any UAL estimator θ_n, (A1)–(A2) together imply (A).

The proof of Theorem 3 is provided in Appendix A.4. In Section 4.1, we illustrate the use of Theorem 3 for the estimation of a G-computed distribution function.

We note that conditions (A1)–(A2) are actually sufficient to establish uniform asymptotic equicontinuity, which as discussed above is stronger than (A). Therefore, Theorem 3 can also be used to prove asymptotic equivalence of the majorization/minorization correction procedure studied in Daouia and Park (2013).

3.2. Special case II: kernel smoothed estimators

For certain parameters, asymptotically linear estimators are not available. In particular, this is the case when the parameter of interest is not sufficiently smooth as a mapping of P₀. For example, density functions, regression functions, and conditional quantile functions do not permit asymptotically linear estimators in a nonparametric model when the exposure is continuous. In these settings, a common approach to nonparametric estimation is kernel smoothing.

Recent results suggest that, as a process, the only possible weak limit of { $r_{n} [θ_{n} - θ_{0} (t)] : t \in T$ } in $ℓ^{\infty} (T)$ is zero when θ_n is a kernel smoothed estimator. For example, in the case of the Parzen-Rosenblatt density estimator with bandwidth h_n, Theorem 3 of Stupfler (2016) implies that if

c_{n} ≔ r_{n} (n h_{n} ∕ ∣ \log h_{n} ∣)^{- 1 ∕ 2} \to 0,

then { $r_{n} [θ_{n} (t) - θ_{0} (t)] : t \in T$ } converges weakly to zero in $ℓ^{\infty} (T)$ , whereas if c_n → c ∈ (0, ∞], then it does not converge weakly to a tight limit in $ℓ^{\infty} (T)$ . As a result, { $r_{n} [θ_{n} (t) - θ_{0} (t)] : t \in T$ } only satisfies uniform stochastic equicontinuity for r_n such that c_n → 0. However, for any such rate r_n, $r_{n}^{- 1}$ is slower than the pointwise and uniform rates of convergence of θ_n − θ₀. As a result, θ_n and $θ_{n}^{*}$ may not be asymptotically equivalent at the uniform rate of convergence of θ_n − θ₀, so that confidence intervals and regions based on the limit distribution of θ_n − θ₀, but centered around $θ_{n}^{*}$ , may not have correct coverage. We note that, while Stupfler (2016) establishes formal results for the Parzen-Rosenblatt estimator, we expect that the results therein extend to a variety of kernel smoothed estimators.

As a result of the lack of uniform stochastic equicontinuity of r_n(θ_n − θ₀) for useful rates r_n, establishing (A) is much more difficult for kernel smoothed estimators than for asymptotically linear estimators. However, since (A) is weaker than uniform stochastic equicontinuity, it may still be possible. Here, we provide alternative sufficient conditions that imply condition (A) and that we have found useful for studying a kernel smoothed estimator θ_n.

When the initial estimator θ_n is kernel smoothed, we can often show that

sup_{t \in T} ∣ r_{n} [θ_{n} (t) - θ_{0} (t)] - a_{n} b_{0} (t) - R_{n} (t) ∣ \overset{P}{\to} 0,

(3.2)

where $b_{0} : T \to R$ is a deterministic bias, a_n is sequences of positive constants, and $R_{n} : T \to R$ is a random remainder term. We then have that

sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ r_{n} [θ_{n} (t) - θ_{0} (t)] - r_{n} [θ_{n} (s) - θ_{0} (s)] ∣ = sup_{‖ t - s ‖ < δ ∕ r_{n}} a_{n} ∣ b_{0} (t) - b_{0} (s) ∣ + sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ R_{n} (t) - R_{n} (s) ∣ + o_{P} (1) .

If b₀ is uniformly continuous on $T$ and a_n = O(1), or b₀ is uniformly α-Hölder on $T$ and $a_{n} = O (r_{n}^{α})$ , then the first term on the right-hand side tends to zero in probability. Attention may then be turned to demonstrating that the second term vanishes in probability. It appears difficult to provide a general characterization of the form of R_n that encompasses kernel smoothed estimators. However, in our experience, it is frequently the case that R_n(t) involves terms of the form $G_{n} ν_{n, t}$ , where $ν_{n, t} : X \to R$ is a deterministic function for each n ∈ {1, 2, …} and $t \in T$ . In the course of demonstrating that

sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ R_{n} (t) - R_{n} (s) ∣ \overset{P}{\to} 0,

a rate of convergence for

sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ G_{n} (ν_{n, t} - ν_{n, s}) ∣

is then required. Defining $F_{n, η} ≔ {ν_{n, t} - ν_{n, s} : ‖ t - s ‖ < η}$ for each η > 0, this is equivalent to establishing a rate of convergence for the local empirical process $‖ G_{n} ‖_{T_{n, δ ∕ r_{n}}} ≔ {sup}_{ξ \in F_{n, δ ∕ r_{n}}} ∣ G_{n} ξ ∣$ . Such rates can be established using tail bounds for empirical processes. We briefly comment on two approaches to obtaining such tail bounds.

We first define bracketing and covering numbers of a class of functions $F$ — see van der Vaart and Wellner (1996) for a comprehensive treatment. We denote by ∥F∥_P,2 = [P(F²)]^1/2 the L₂(P) norm of a given P-square-integrable function $F : X \to R$ . The bracketing number $N_{[]} (ε, F, L_{2} (P))$ of a class of functions $F$ with respect to the L₂(P) norm is the smallest number of ε-brackets needed to cover $F$ , where an ε-bracket is any set of functions {f : ℓ ≤ f ≤ u} with ℓ and u such that ∥ℓ − u∥_P,2 < ε. The covering number $N (ε, F, L_{2} (Q))$ of $F$ with respect to the L₂(Q) norm is the smallest number of ε-balls in L₂(Q) required to cover $F$ . The uniform covering number is the supremum of $N (ε ‖ F ‖_{2, Q}, F, L_{2} (Q))$ over all discrete probability measures Q such that ∥F∥_Q,2 > 0, where F is an envelope function for $F$ . The bracketing and uniform entropy integrals for $F$ with respect to F are then defined as

J_{[]} (δ, F) ≔ \int_{0}^{δ} [1 + \log N_{[]} (ε ‖ F ‖_{P_{0}, 2}, F, L_{2} (P_{0}))]^{1 ∕ 2} d ε J (δ, F) ≔ sup_{Q} \int_{0}^{δ} [1 + \log N (ε ‖ F ‖_{Q, 2}, F, L_{2} (Q))]^{1 ∕ 2} d ε .

We discuss two approaches to controlling $‖ G_{n} ‖_{F_{n, δ ∕ r_{n}}}$ using these integrals. Suppose that $F_{n, η}$ has envelope function F_n,η in the sense that ∣ξ(x)∣ ≤ F_n,η for all $ξ \in F_{n, η}$ and $x \in X$ . The first approach is useful when ∥F_{n,δ/r_n}∥_P₀,2 can be adequately controlled. Specifically, if either $J (1, F_{n, δ ∕ r_{n}})$ or $J_{[]} (1, F_{n, δ ∕ r_{n}})$ is O(1), then $‖ G_{n} ‖_{F_{n, δ_{n}}} \leq M_{δ} ‖ F_{n, δ ∕ r_{n}} ‖_{P_{0,} 2}$ for all n and some constant M_δ ∈ (0, ∞) not depending on n by Theorems 2.14.1 and 2.14.2 of van der Vaart and Wellner (1996).

The second approach we consider is useful when the envelope functions do not shrink in expectation, but the functions in $F_{n, η}$ still get smaller in the sense that $γ_{n, δ} ≔ {sup}_{ξ \in F_{n, δ ∕ r_{n}}} ‖ ξ ‖_{P_{0}, 2}$ tends to zero. For example, if ν_n,t is defined as ν_n,t(x) := I(0 ≤ x ≤ t) for each $x \in X \subseteq R$ , t ∈ [0, 1], and n, then F_n,η : x ↦ I(0 ≤ x ≤ 1) is the natural envelope function for $F_{n, η}$ for all n and η, so that ∥F_{n,δ/r_n}∥_P₀,2 does not tend to zero. However, if the density p₀ corresponding to P₀ is bounded above by ${\bar{p}}_{0}$ , then $γ_{n, δ}^{2} \leq {\bar{p}}_{0} δ ∕ r_{n}$ , which does tend to zero. In these cases, the basic tail bounds in Theorem 2.14.1 and 2.14.2 of van der Vaart and Wellner (1996) are too weak. Sharper, but slightly more complicated, bounds may be used instead. Specifically, if F_{n,δ/r_n} ≤ C < ∞ for all n large enough and either

J (γ_{n, δ}, F_{n, δ ∕ r_{n}}) + \frac{J (γ_{n, δ}, F_{n, δ ∕ r_{n}})^{2}}{γ_{n, δ}^{2} n^{1 ∕ 2}} or J_{[]} (γ_{n, δ}, F_{n, δ ∕ r_{n}}) + \frac{J_{[]} (γ_{n, δ}, F_{n, δ ∕ r_{n}})^{2}}{γ_{n, δ}^{2} n^{1 ∕ 2}}

are $o (z_{n}^{- 1})$ , then $‖ G_{n} ‖_{F_{n, δ ∕ r_{n}}} = o_{P} (z_{n}^{- 1})$ by Lemma 3.4.2 of van der Vaart and Wellner (1996) and Theorem 2.1 of van der Vaart and Wellner (2011). Analogous statements hold if these expressions are $O (z_{n}^{- 1})$ .

In some cases, both of these approaches must be used to control different terms arising within R_n(t), as for the conditional distribution function discussed in Section 4.2.

4. Illustrative examples

4.1. Example 1: Estimation of a G-computed distribution function

We first demonstrate the use of Theorem 3 in the particular problem in which we wish to draw inference on a G-computed distribution function. Suppose that the data unit is the vector X = (Y, A, W), where Y is an outcome, A ∈ {0, 1} is an exposure, and W is a vector of baseline covariates. The observed data consist of independent draws X₁, X₂, … , X_n from $P_{0} \in M$ , where $M$ is a nonparametric model.

For $P \in M$ and a₀ ∈ {0, 1}, we define the parameter value θ_P,a₀ pointwise as θ_P,a₀(t) := E_P {P (Y ≤ t ∣ A = a₀, W)} the G-computed distribution function of Y evaluated at t, where the outer expectation is over the marginal distribution of W under P. We are interested in estimating θ_0,a₀ := θ_P₀,a₀. This parameter is often of interest as an interpretable marginal summary of the relationship between Y and A accounting for the potential confounding induced by W. Under certain causal identification conditions, θ_0,a₀ is the distribution function of the counterfactual outcome Y(a₀) defined by the intervention that deterministically sets exposure to A = a₀ (Robins, 1986; Gill and Robins, 2001).

For each t, the parameter P ↦ θ_P,a₀ (t) is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function φ_P,a₀,t at $P \in M$ is given by

(y, a, w) \mapsto \frac{I (a = a_{0})}{g_{P} (a_{0} ∣ w)} [I (y \leq t) - {\bar{Q}}_{P} (t ∣ a_{0}, w)] + {\bar{Q}}_{P} (t ∣ a_{0}, w) - θ_{P, a_{0}} (t),

where g_P(a₀ ∣ w) := P(A = a₀ ∣ W = w) is the propensity score and ${\bar{Q}}_{P} (t ∣ a_{0}, w) ≔ P (Y \leq t ∣ A = a_{0}, W = w)$ is the conditional exposure-specific distribution function, as implied by P (van der Laan and Robins, 2003). Given estimators g_n and ${\bar{Q}}_{n}$ of g₀ := g_P₀ and ${\bar{Q}}_{0} ≔ {\bar{Q}}_{P_{0}}$ , respectively, several approaches can be used to construct, for each t, an asymptotically linear estimator of θ₀(t) with influence function ϕ_0,a₀,t = φ_P₀,a₀,t. For example, the use of either optimal estimating equations or the one-step correction procedure leads to the doubly-robust augmented inverse-probability-of-weighting estimator

θ_{n, a_{0}} (t) ≔ \frac{1}{n} \sum_{i = 1}^{n} {\frac{I (A_{i} = a_{0})}{g_{n} (a_{0} ∣ W_{i})} [I (Y_{i} \leq t) - {\bar{Q}}_{n} (t ∣ a_{0}, W_{i})] + {\bar{Q}}_{n} (t ∣ a_{0}, W_{i})},

as discussed in detail in van der Laan and Robins (2003). Under conditions on g_n and ${\bar{Q}}_{n}$ , including consistency at fast enough rates, θ_n,a₀(t) is asymptotically efficient relative to $M$ . In this case, θ_n,a₀(t) satisfies (3.1) with influence function ϕ_0,a₀,t. However, there is no guarantee that θ_n,a₀ is monotone.

In the context of this example, we can identify simple sufficient conditions under which conditions (A)–(B), and hence the asymptotic equivalence of the initial and isotonized estimators of the G-computed distribution function, are guaranteed. Specifically, we find this to be the case when both:

there exists η > 0 such that g₀(a₀ ∣ W) ≥ η almost surely under P₀;
there exist non-negative real-valued functions K₁, K₂ such that

K_{1} (w) ∣ t - s ∣ \leq ∣ {\bar{Q}}_{0} (t ∣ a_{0}, w) - {\bar{Q}}_{0} (s ∣ a_{0}, w) ∣ \leq K_{2} (w) ∣ t - s ∣

for all t, $s \in T$ , and such that, under P₀, K₁(W) is strictly positive with non-zero probability and K₂(W) has finite second moment.

We conducted a simulation study to validate our theoretical results in the context of this particular example. For samples sizes 100, 250, 500, 750, and 1000, we generated 1000 random datasets as follows. We first simulated a bivariate covariate W with independent components W₁ and W₂, respectively distributed as a Bernoulli variate with success probability 0.5 and a uniform variate on (−1, 1). Given W = (w₁, w₂), exposure A was simulated from a logistic regression model with

P_{0} (A = 1 ∣ W_{1} = w_{1}, W_{2} = w_{2}) = expit (0.5 + w_{1} - 2 w_{2}) .

Given W = (w₁, w₂) and A = a, Y was simulated as the inverse-logistic transformation of a normal variate with mean 0.2 − 0.3a − 4w₂ and variance 0.3.

For each simulated dataset, we estimated θ_0,0(t) and θ_0,1(t) for t equal to each outcome value observed between 0.1 and 0.9. To do so, we used the estimator described above, with propensity score and conditional exposure-specific distribution function estimated using correctly-specified parametric models. We employed two correction procedures for the estimators θ_n,0 and θ_n,1. First, we projected θ_n,0 and θ_n,1 onto the space of monotone functions separately. Second, noting that θ_0,0(t) ≤ θ_0,1(t) for all t, so that (a, t) ↦ θ_0,a(t) is component-wise monotone for this particular data-generating distribution, we considered the projection of (a, t) ↦ θ_n,a(t) onto the space of bivariate monotone functions on ${0, 1} \times T$ . For each simulation and each projection procedure, we recorded the maximal absolute differences between (i) the initial and and projected estimates, (ii) the initial estimate and the truth, and (iii) the projected estimate and the truth. We also recorded the maximal widths of the initial and projected confidence bands.

Figure 1 displays the results of this simulation study, with output from the univariate and bivariate projection approaches summarized in the top and bottom rows, respectively. The left column displays the empirical distribution of the scaled maximum absolute discrepancy between θ_n and $θ_{n}^{*}$ for all sample sizes studied. This plot confirms that the discrepancy between these two estimators indeed decreases faster than n^−1/2, as our theory suggests. Furthermore, for each n, the discrepancy is larger for the two-dimensional projection.

Fig 1. — Summary of simulation results for G-computed distribution function. Each plot shows cumulative distributions of a particular discrepancy over 1000 simulated datasets for different values of n. Left panel: maximal absolute difference between the initial and isotonic estimators over the grid used for projecting, scaled up by root-n. Middle panel: ratio of the maximal absolute difference between the initial estimator and the truth and the maximal absolute difference between the isotonic estimator and the truth. Right panel: ratio of the maximal width of the initial confidence band and the maximal width of the isotonic confidence band. The top row shows the results for the univariate projection, and the bottom row shows the results for the bivariate projection.

The middle column of Figure 1 displays the empirical distribution function of the ratio between the maximum discrepancy between θ_n and θ₀ and that of $θ_{n}^{*}$ and θ₀. This plot confirms that $θ_{n}^{*}$ is always at least as close to θ₀ than is θ_n over $T_{n}$ . The maximum discrepancy between θ_n and θ₀ can be more than 25% larger than that between $θ_{n}^{*}$ and θ₀ in the univariate case, and up to 50% larger in the bivariate case.

The right column of Figure 1 displays the empirical distribution function of the ratio between the maximum size of the initial uniform 95% influence function-based confidence band and that of the isotonic band. For large samples, the maximal widths are often close, but for smaller samples, the initial confidence bands can be up to 50% larger than the isotonic bands, especially for the bivariate case. The empirical coverage of both bands is provided in Table 1. The coverage of the isotonic band is essentially the same as the initial band for the univariate case, whereas it is slightly larger than that of the initial band in the bivariate case.

Table 1.

Coverage of 95% confidence bands for the true counterfactual distribution function.

	n	100	250	500	750	1000
d=1	Initial band	92.5	94.1	96.0	94.5	95.5
d=1	Monotone band	92.5	94.1	96.0	94.5	95.5
d=2	Initial band	93.9	94.0	95.0	94.6	94.9
d=2	Monotone band	95.7	95.9	95.5	95.3	95.1

Open in a new tab

4.2. Example 2: Estimation of a conditional distribution function

We next demonstrate the use of Theorem 2 with dimension d = 2 for drawing inference on a conditional distribution function. Suppose that the data unit is the vector X = (A, Y), where Y is an outcome and A is now a continuous exposure. The observed data consist of independent draws (A₁, Y₁), (A₂, Y₂), … , (A_n, Y_n), from $P_{0} \in M$ , where $M$ is a nonparametric model. We define the parameter value θ_P pointwise as θ_P(t₁, t₂) := P (Y ≤ t₁ ∣ A = t₂). Thus, θ_P is the conditional distribution function of Y at t₁ given A = t₂. The map (t₁, t₂) ↦ θ_P(t₁, t₂) is necessarily monotone in t₁ for each fixed t₂, and in some settings, it may be known that it is also monotone in t₂ for each fixed t₁. This parameter completely describes the conditional distribution of Y given A, and can be used to obtain the conditional mean, conditional quantiles, or any other conditional parameter of interest.

For each t₁, the true function θ₀(t₁, t₂) = θ_P₀ (t₁, t₂) may be written as the conditional mean of I(Y ≤ t₁) given A = t₂. Hence, any method of nonparametric regression can be used to estimate t₂ ↦ θ₀(t₁, t₂) for fixed t₁, and repeating such a method over a grid of values of t₁ yields an estimator of the entire function. We expect that our results would apply to many of these methods. Here, we consider the local linear estimator (Fan and Gijbels, 1996), which may be expressed as

θ_{n} (t_{1}, t_{2}) ≔ \frac{1}{n h_{n}} \sum_{i = 1}^{n} I (Y_{i} \leq t_{1}) [\frac{s_{2, n} (t_{2}) - s_{1, n} (t_{2}) (A_{i} - t_{2})}{s_{0, n} (t_{2}) s_{2, n} (t_{2}) - s_{1, n} (t_{2})^{2}}] K (\frac{A_{i} - t_{2}}{h_{n}}),

where $K : R \to R$ is a symmetric and bounded kernel function, h_n → 0 is a sequence of bandwidths, and

s_{j, n} (t_{2}) ≔ \frac{1}{n h_{n}} \sum_{i = 1}^{n} (A_{i} - t_{2})^{j} K (\frac{A_{i} - t_{2}}{h_{n}})

for j ∈ {0, 1, 2}. Under regularity conditions on the true distribution function θ₀, the marginal density f₀ of A, the bandwidth sequence h_n, and the kernel function K, for any fixed (t₁, t₂), θ_n satisfies

(n h_{n})^{1 ∕ 2} [θ_{n} (t_{1}, t_{2}) - θ_{0} (t_{1}, t_{2}) - h_{n}^{2} V_{K} b_{0} (t_{1}, t_{2})] \overset{d}{\to} N (0, S_{K} v_{0} (t_{1}, t_{2})),

where V_K := ʃ x²K(x)dx is the variance of K, S_K := ʃ K(x)²dx, and b₀(t₁, t₂) and v₀(t₁, t₂) depend on the derivatives of θ₀ and on f₀. If h_n is chosen to be of order n^−1/5, the rate that minimizes the asymptotic mean integrated squared error of θ_n relative to θ₀, then n^2/5 [θ_n(t₁, t₂) − θ₀(t₁, t₂)] converges in law to a normal random variate with mean V_Kb₀(t₁, t₂) and variance S_Kv₀(t₁, t₂). Under stronger regularity conditions, the rate of convergence of the uniform norm $‖ θ_{n} - θ_{0} ‖_{T}$ can be shown to be (nh_n/ log n)^1/2 (Hardle, Janssen and Serfling, 1988).

Theorem 3 cannot be used to establish (A) in this problem, since θ_n is not an asymptotically linear estimator. Furthermore, as discussed above, recent results suggest that { ${r_{n} [θ_{n} (t) - θ_{0} (t)] : t \in T}$ } does not converge weakly to a tight limit in $ℓ^{\infty} (T)$ for any useful rate r_n. Despite this lack of weak convergence, condition (A) can be verified directly in the context of this example under smoothness conditions on θ₀ and f₀ using the tail bounds for empirical processes outlined in Section 3.2. Denoting by $θ_{0, t_{2}}^{'}$ and $θ_{0, t_{2}}^{″}$ the first and second derivatives of θ₀ with respect to its second argument, we define

R_{θ}^{(2)} (t, δ) ≔ θ_{0} (t_{1}, t_{2} + δ) - θ_{0} (t_{1}, t_{2}) - δ θ_{0, t_{2}}^{'} (t_{1}, t_{2}) - \frac{1}{2} δ^{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2})

and $R_{f}^{(1)} (t, δ) ≔ f_{0} (t_{2} + δ) - f_{0} (t_{2}) - δ f_{0}^{'} (t_{2})$ , where $f_{0}^{'}$ is the derivative of f₀. We then introduce the following conditions on θ₀, f₀, and K:

(d) $θ_{0, t_{2}}^{″}$ exists and is continuous on $T$ , and as δ → 0, ${sup}_{t \in T} ∣ R_{θ}^{(2)} (t, δ) ∣ = o (δ^{2})$ ;

(e) ${inf}_{t \in T} f_{0} (t) > 0$ , $f_{0}^{'}$ exists and is continuous on $T$ , and ${sup}_{t \in T} ∣ R_{f}^{(1)} (t, δ) ∣ = o (δ)$ ;

(f) K is a Lipschitz function supported on [−1, 1] satisfying condition (M) of Stupfler (2016).

We also define

ν_{n, t} (y, a) ≔ [I (y \leq t_{1}) - θ_{0} (t_{1}, a)] K (\frac{a - t_{2}}{h_{n}}); g_{n} (t_{2}) ≔ s_{0, n} (t_{2}) s_{2, n} (t_{2}) - s_{1, n} (t_{2})^{2}; R_{n} (t) ≔ h_{n}^{- 1 ∕ 2} [\frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} G_{n} ν_{n, t} - \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} G_{n} (ℓ_{t} ν_{n, t})] .

We then have the following result.

Proposition 1. If (d)–(f) hold, $n h_{n}^{4} ∕ \log h_{n}^{- 1} \to \infty$ and $n h_{n}^{5} = O (1)$ , then

sup_{t \in T} ∣ (n h_{n})^{1 ∕ 2} [θ_{n} (t_{1}, t_{2}) - θ_{0} (t_{1}, t_{2})] - (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} - R_{n} (t) ∣ \overset{P}{\to} 0 .

Proposition 1 aids in establishing the following result, which formally establishes asymptotic equivalence of the local linear estimator of a conditional distribution function and its correction obtained via isotonic regression at the rate r_n = (nh_n)^1/2.

Proposition 2. If (d)–(f) hold and $n h_{n}^{5} \to c \in (0, \infty)$ , then (A) holds for the local linear estimator with r_n = (nh_n)^1/2.

The proofs of Propositions 1 and 2 are provided in Appendix A.5. These results may also be of interest in their own right for establishing other properties of the local linear estimator.

As with the first example, we conducted a simulation study to validate our theoretical results. For samples sizes n ∈ {100, 250, 500, 750, 1000}, we generated 1000 random datasets as follows. We first simulated A as a Beta(2, 3) variate. Given A = a, Y was simulated as the inverse-logistic transformation of a normal variate with mean 0.5 × [1 + (a − 1.2)²] and variance one.

For each simulated dataset, we estimated θ₀(y, a) for each (y, a) in an equally spaced square grid of mesh ω_n = n^−4/5. For each unique y in this grid, we estimated the function a ↦ θ₀(y, a) using the local linear estimator, as implemented in the R package KernSmooth (Wand, 2015; Wand and Jones, 1995). For each value of y in the grid, we computed the optimal bandwidth based on the direct plug-in methodology of Ruppert, Sheather and Wand (1995) as implemented by the dpill function, and we then set our bandwidth as the average of these y-specific bandwidths. We constructed initial confidence bands using a variable-width nonparametric bootstrap (Hall and Kang, 2001).

We first note that, for all sample sizes considered, over 99% of simulations had monotonicity violations in both the y- and a-directions. Figure 2 displays the results of this simulation study. The left exhibit of Figure 2 confirms that the discrepancy between θ_n and $θ_{n}^{*}$ decreases faster than $r_{n}^{- 1} = n^{- 2 ∕ 5}$ , as our theory suggests. The middle exhibit indicates that in roughly 50% of simulations, there is less than 5% difference between $‖ θ_{n}^{*} - θ_{0} ‖_{T_{n}}$ and $‖ θ_{n} - θ_{0} ‖_{T_{n}}$ , but even for n = 1000, in roughly 25% of simulations, $θ_{n}^{*}$ offers at least a 25% improvement in estimation error. In smaller samples, the estimation error of $θ_{n}^{*}$ is less than half that of θ_n in 5–10% of simulations. The rightmost exhibit indicates that the projected confidence bands regularly reduce the uniform size of the initial bands by 10–20%. Finally, the empirical coverage of uniform 95% bootstrap-based bands and their projected versions is provided in Table 2. As before, the projected band is always more conservative than the initial band, and the difference in coverage diminishes as n grows. However, the initial bands in this example are anti-conservative, even at n = 1000, likely due to the slower rate of convergence, and the corrected bands offer a much more substantial improvement in this example than in the first.

Fig 2. — Summary of simulation results for conditional distribution function. The three columns display the same results as those in Figure 1.

Table 2.

Coverage of 95% confidence bands for the true conditional distribution function.

n	100	250	500	750	1000
Initial band	37.6	64.9	83.2	86.3	89.7
Monotone band	60.8	80.4	90.3	92.3	93.9

Open in a new tab

5. Discussion

Many estimators of function-valued parameters in nonparametric and semiparametric models are not guaranteed to respect shape constraints on the true function. A simple and general solution to this problem is to project the initial estimator onto the constrained parameter space over a grid whose mesh goes to zero fast enough with sample size. However, this introduces the possibility that the projected estimator has different properties than the original estimator. In this paper, we studied the important shape constraint of multivariate component-wise monotonicity. We provided results indicating that the projected estimator is generically no worse than the initial estimator, and that if the true function is strictly increasing and the initial estimator possesses a relatively weak type of stochastic equicontinuity, the projected estimator is uniformly asymptotically equivalent to the initial estimator. We provided especially simple sufficient conditions for this latter result when the initial estimator is uniformly asymptotically linear, and provided guidance on establishing the key condition for kernel smoothed estimators.

We studied the application of our results in two examples: estimation of a G-computed distribution function, for use in understanding the effect of a binary exposure on an outcome when the exposure-outcome relationship is confounded by recorded covariates, and of a conditional distribution function, for use in characterizing the marginal dependence of an outcome on a continuous exposure. In numerical studies, we found that the projected estimator yielded improvements over the initial estimator. The improvements were especially strong in the latter example.

In our examples, we only studied corrections in dimensions d = 1 and d = 2. In future work, it would be interesting to consider corrections in dimensions higher than 2. For example, for the conditional distribution function, it would be of interest to study multivariate local linear estimators for a continuous exposure A taking values in $R^{d - 1}$ for d > 2. Since tailored algorithms for computing the isotonic regression do not yet exist for d > 2, it would also be of interest to determine whether a version of Theorem 2 could be established for the relaxed isotonic estimator proposed by Fokianos, Leucht and Neumann (2017). Alternatively, it is possible that the uniform stochastic equicontinuity currently required by Chernozhukov, Fernández-Val and Galichon (2010) and Daouia and Park (2013) for asymptotic equivalence of the rearrangement- and envelope-based corrections, respectively, could be relaxed along the lines of our condition (A). Finally, our theoretical results do not give the exact asymptotic behavior of the projected estimator or projected confidence band when the true function possesses flat regions. This is also an interesting topic for future research.

Acknowledgements

The authors gratefully acknowledge the constructive comments of the editors and anonymous reviewers as well as grant support from the National Institute of Allergy and Infectious Diseases (NIAID) and the National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of Health.

Supported by NIAID grant UM1AI068635.

Supported by NIAID grant R01AI074345.

Supported by NHLBI grant R01HL137808.

Appendix A: Technical proofs

A.1. Proof of Theorem 1

Part (i) follows from Corollary B to Theorem 1.6.1 of Robertson, Wright and Dykstra (1988). For parts (ii) and (iii), we note that by assumption

∣ θ_{n}^{*} (t) - θ_{0} (t) ∣ \leq \sum_{k} λ_{k, n} (t) ∣ θ_{n}^{*} (s_{k}) - θ_{0} (s_{k}) ∣ + \sum_{k} λ_{k, n} (t) ∣ θ_{0} (s_{k}) - θ_{0} (t) ∣

for every $t \in T$ , where Σ_k λ_k,n(t) = 1, and for each k, $s_{k} \in T_{n}$ and ∥s_k − t∥ ≤ 2ω_n. By part (i), the first term is bounded above by ${sup}_{s \in T_{n}} ∣ θ_{n} (s) - θ_{0} (s) ∣$ . The second term is bounded above by γ(2ω_n), where we define

γ (δ) ≔ sup {∣ θ_{0} (t) - θ_{0} (s) ∣ : t, s \in T, ‖ t - s ‖ \leq δ} .

If θ₀ is continuous on $T$ , then it is also uniformly continuous since $T$ is compact. Therefore, γ(δ) → γ(0) = 0 as δ → 0, so that γ(2ω_n) →_P 0 if ω_n →_P 0. If γ(δ) = o(δ^α) as δ → 0, then $γ (2 ω_{n}) = o_{P} (ω_{n}^{α})$ .

Part (iv) follows from the proof of Proposition 3 of Chernozhukov, Fernández-Val and Galichon (2009), which applies to any order-preserving monotonization procedure. For the first statement of (v), by their definition as minimizers of the least-squares criterion function, we note that $\sum_{t \in T_{n}} u_{n}^{*} (t) = \sum_{t \in T_{n}} u_{n} (t)$ , and similarly for $ℓ_{n}^{*}$ . The second statement of (v) follows from a slight modification of Theorem 1.6.1 of Robertson, Wright and Dykstra (1988). As stated, the result says that $\sum_{t \in T_{n}} G (θ^{*} (t) - θ (t)) \leq \sum_{t \in T_{n}} G (θ (t) - ψ (t))$ for any convex function $G : R \to R$ and monotone function ψ, where θ* is the isotonic regression of θ over $T_{n}$ . A straightforward adaptation of the proof indicates that $\sum_{t \in T_{n}} G (θ_{1}^{*} (t) - θ_{2}^{*} (t)) \leq \sum_{t \in T_{n}} G (θ_{1} (t) - θ_{2} (t))$ , where now $θ_{1}^{*}$ and $θ_{2}^{*}$ are the isotonic regressions of θ₁ and θ₂ over $T_{n}$ , respectively. As in Corollary B, taking G(x) = ∣x∣^p and letting p → ∞ yields that $‖ θ_{1}^{*} - θ_{2}^{*} ‖_{T_{n}} \leq ‖ θ_{1} - θ_{2} ‖_{T_{n}}$ . Applying this with θ₁ = u_n and θ₂ = ℓ_n establishes the second portion of (v). □

A.2. Proof of Theorem 2

We prove Theorem 2 via three lemmas, which may be of interest in their own right. The first lemma controls the size of deviations in θ_n over small neighborhoods, and does not hinge on condition (C) holding.

Lemma 1. If (A)–(B) hold and $b_{n} = o_{P} (r_{n}^{- 1})$ , then

sup_{‖ t - s ‖ \leq b_{n}} ∣ θ_{n} (t) - θ_{0} (s) ∣ = o_{P} (r_{n}^{- 1}) .

Proof of Lemma 1. In view of the triangle inequality,

∣ θ_{n} (t) - θ_{n} (s) ∣ \leq ∣ {θ_{n} (t) - θ_{0} (t)} - {θ_{n} (s) - θ_{0} (s)} ∣ + ∣ θ_{0} (t) - θ_{0} (s) ∣ .

The first term is $o_{P} (r_{n}^{- 1})$ by (A), whereas the second term is $o_{P} (r_{n}^{- 1})$ by (B). □

The second lemma controls the size of neighborhoods over which violations in monotonicity can occur. Henceforth, we define

κ_{n} ≔ sup {‖ t - s ‖ : s, t \in T, s \leq t, θ_{n} (t) \leq θ_{n} (s)} .

In this lemma we again require (A) but now require (C) rather than (B).

Lemma 2. If (A) and (C) hold, then $κ_{n} = o_{P} (r_{n}^{- 1})$ .

Proof of Lemma 2. Let ϵ > 0 and η_n := ϵ/r_n. Suppose that κ_n > η_n. Then, there exist s, $t \in T$ with s < t and ∥t − s∥ > η_n such that θ_n(s) ≥ θ_n(t). We claim that there must also exist s*, $t^{*} \in T$ with s* < t* and ∥t* − s*∥ ∈ [η_n/2, η_n] such that θ_n(s*) ≥ θ_n(t*). To see this, let J = ⌊∥t − s∥/(η_n/2)⌋ − 1, and note that J ≥ 1. Define t_j := s+(jη_n/2)(t − s)/∥t − s∥ for j = 0, 1, … , J, and set t_J+1 := t. Thus, t_j < t_j+1 and ∥t_j+1 − t_j∥ ∈ [η_n/2, η_n] for each j = 0, 1, … , J. Since then $\sum_{j = 0}^{J} [θ_{n} (t_{j + 1}) - θ_{n} (t_{j})] = θ_{n} (t) - θ_{n} (s) \leq 0$ , it must be that θ_n(t_j+1) ≤ θ_n (t_j) for at least one j. This proves the claim.

We now have that κ_n > η_n implies that there exist s, $t \in T$ with s < t and ∥t − s∥ ∈ [η_n/2, η_n] such that θ_n(s) ≥ θ_n(t). This further implies that

{θ_{n} (t) - θ_{0} (t)} - {θ_{n} (s) - θ_{0} (s)} \leq - {θ_{0} (t) - θ_{0} (s)} \leq - K_{0} ‖ t - s ‖ \leq - K_{0} η_{n} ∕ 2

by condition (B). Finally, this allows us to write

P_{0} (κ_{n} > \frac{ϵ}{r_{n}}) \leq P_{0} {sup_{‖ t - s ‖ \leq ϵ ∕ r_{n}} r_{n} ∣ [θ_{n} (t) - θ_{0} (t)] - [θ_{n} (s) - θ_{0} (s)] ∣ \geq \frac{K_{0} ϵ}{2}} .

By condition (A), this probability tends to zero for every ϵ > 0, which completes the proof. □

Our final lemma bounds the maximal absolute deviation between $θ_{n}^{*}$ and θ_n over the grid $T_{n}$ in terms of the supremal deviations of θ_n over neighborhoods smaller than κ_n. This lemma does not depend on any of the conditions (A)–(C).

Lemma 3. It holds that $\max_{t \in T_{n}} ∣ θ_{n}^{*} (t) - θ_{n} (t) ∣ \leq {sup}_{‖ s - t ‖ \leq κ_{n}} ∣ θ_{n} (s) - θ_{n} (t) ∣$ .

Proof of Lemma 3. By Theorem 1.4.4 of Robertson, Wright and Dykstra (1988), for any $t \in T_{n}$ ,

θ_{n}^{*} (t) = \max_{U \in U_{t}} min_{L \in L_{t}} θ_{n} (U \cap L) = min_{L \in L_{t}} \max_{U \in U_{t}} θ_{n} (U \cap L),

where, for any finite set $S \subseteq T_{n}$ , θ_n(S) is defined as ∣S∣⁻¹ Σ_s∈S θ_n(s). The sets U range over the collection $U_{n}$ of upper sets of $T_{n}$ containing t, where $U \subseteq T_{n}$ is called an upper set if t₁ ∈ U, $t_{2} \in T_{n}$ and t₁ ≤ t₂ implies t₂ ∈ U. The sets L range over the collection $L_{t}$ of lower sets of $T_{n}$ containing t, where $L \subseteq T_{n}$ is called a lower set if t₁ ∈ L, $t_{2} \in T_{n}$ and t₂ ≤ t₁ implies t₂ ∈ L.

Let U_t := {s : s ≥ t} and L_t := {s : s ≤ t}. First, suppose there exists $L_{0} \in L_{t}$ and s₀ ∈ L₀ with s₀ > t and ∥t − s₀∥ > κ_n. Then, we claim that there exists another lower set $L_{0}^{'} \in L_{t}$ such that θ_n(U_t ∩ L₀) > θ_n $(U_{t} \cap L_{0}^{'})$ . If θ_n(U_t ∩ L₀) > θ_n(t) = θ_n(U_t ∩ L_t), then $L_{0}^{'} = L_{t}$ satisfies the claim. Otherwise, if θ_n(U_t ∩ L₀) ≤ θ_n(t), let

L_{0}^{'} ≔ L_{0} ∖ {s : s > t, ‖ t - s ‖ > κ_{n}} .

One can verify that $L_{0}^{'} \in L_{t}$ , and since s₀ ∈ L₀ \ $L_{0}^{'}$ , $L_{0}^{'}$ is a strict subset of L₀. Furthermore, by definition of κ_n, θ_n(s) > θ_n(t) for all s > t such that ∥t − s∥ > κ_n, and since θ_n(U_t ∩ L₀) ≤ θ_n(t), removing these elements from L₀ can only reduce the average, so that $θ_{n} (U_{t} \cap L_{0}^{'})$ < θ_n(U_t ∩ L₀). This establishes the claim. By an analogous argument, we can show that if there exists $U_{0} \in U_{t}$ and s₀ ∈ U₀ with s₀ < t and ∥t − s₀∥ > κ_n, then there exists another upper set $U_{0}^{'} \in U_{t}$ such that θ_n(U₀ ∩ L_t) < $θ_{n} (U_{0}^{'} \cap L_{t})$ .

Let $L^{*} \in {argmin}_{L \in L_{t}} θ_{n} (U_{t} \cap L)$ and $U^{*} \in {argmax}_{U \in U_{t}} θ_{n} (U \cap L_{t})$ . Then,

θ_{n}^{*} (t) = \max_{U \in U_{t}} min_{L \in L_{t}} θ_{n} (U \cap L) \geq min_{L \in L_{t}} θ_{n} (U_{t} \cap L) = θ_{n} (U_{t} \cap L^{*}) and θ_{n}^{*} (t) = min_{L \in L_{t}} \max_{U \in U_{t}} θ_{n} (U \cap L) \leq \max_{U \in U_{t}} θ_{n} (U \cap L_{t}) = θ_{n} (U^{*} \cap L_{t}) .

Hence, $θ_{n} (U_{t} \cap L^{*}) \leq θ_{n}^{*} (t) \leq θ_{n} (U^{*} \cap L_{t})$ . By the above argument, we have that both

θ_{n} (U_{t} \cap L^{*}) \geq inf {θ_{n} (s) : s \geq t, ‖ t - s ‖ \leq κ_{n}} and θ_{n} (U^{*} \cap L_{t}) \leq sup {θ_{n} (s) : s \leq t, ‖ t - s ‖ \leq κ_{n}} .

Therefore, we find that

inf {θ_{n} (s) - θ_{n} (t) : ‖ t - s ‖ \leq κ_{n}} \leq θ_{n}^{*} (t) - θ_{n} (t) \leq sup {θ_{n} (s) - θ_{n} (t) : ‖ t - s ‖ \leq κ_{n}},

and thus, $∣ θ_{n}^{*} (t) - θ_{n} (t) ∣ \leq sup {∣ θ_{n} (s) - θ_{n} (t) ∣ : ‖ t - s ‖ \leq κ_{n}}$ . Taking the maximum over $t \in T_{n}$ yields the claim. □

The proof of Theorem 2 follows easily from Lemmas 1, 2 and 3.

Proof of Theorem 2. By construction, for each $t \in T$ , we can write

∣ θ_{n}^{*} (t) - θ_{n} (t) ∣ \leq Σ_{j = 1}^{2^{d}} λ_{j, n} (t) ∣ θ_{n}^{*} (s_{j}) - θ_{n} (s_{j}) ∣ + Σ_{j = 1}^{2^{d}} λ_{j, n} (t) ∣ θ_{n} (s_{j}) - θ_{n} (t) ∣,

where $s_{j} \in T_{n}$ and ∥s_j − t∥ ≤ 2ω_n for all t, s_j by definition. Thus, since Σ_j λ_j,n(t) = 1, it follows that

sup_{t \in T} ∣ θ_{n}^{*} (t) - θ_{n} (t) ∣ \leq \max_{t \in T_{n}} ∣ θ_{n}^{*} (t) - θ_{n} (t) ∣ + sup_{‖ s - t ‖ \leq 2 ω_{n}} ∣ θ_{n} (s) - θ_{n} (t) ∣ .

By Lemma 3, the first summand is bounded above by sup_{∥s−t∥≤κ_n} ∣θ_n(s) − θ_n(t)∣, which is $o_{P} (r_{n}^{- 1})$ by Lemmas 1 and 2. The second summand is $o_{P} (r_{n}^{- 1})$ by Lemma 1. □

A.3. Proof of Corollary 1

We note that ℓ_n(t) ≤ θ₀(t) ≤ u_n (t) if and only if

{r_{n} [θ_{n} (t) - ℓ_{n} (t)] - γ_{α} (t)} + γ_{α} (t) \geq r_{n} [θ_{n} (t) - θ_{0} (t)] \geq - γ_{α} (t) - {r_{n} [u_{n} (t) - θ_{n} (t)] - γ_{α} (t)} .

Therefore, by conditions (a)–(c), $P_{0} [ℓ_{n} (t) \leq θ_{0} (t) \leq u_{n} (t) for all t \in T] \to 1 - α$ . Next, we let δ > 0 and note that

sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ r_{n} {ℓ_{n} (t) - θ_{n} (t)} - r_{n} {ℓ_{n} (s) - θ_{0} (s)} ∣ \leq sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ r_{n} {θ_{n} (t) - θ_{0} (t)} - r_{n} {θ_{n} (s) - θ_{0} (s)} ∣ + sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ γ_{α} (t) - γ_{α} (s) ∣ + 2 ‖ r_{n} (θ_{n} - ℓ_{n}) - γ_{α} ‖_{T} .

The first term tends to zero in probability by (A), the second by conditions (a)–(c), and the third by the assumed uniform continuity of γ_α. An analogous decomposition holds for u_n. Therefore, we can apply Theorem 2 with u_n and ℓ_n in place of θ_n to find that $‖ ℓ_{n}^{*} - ℓ_{n} ‖_{T} = o_{P} (r_{n}^{- 1})$ and $‖ u_{n}^{*} - u_{n} ‖_{T} = o_{P} (r_{n}^{- 1})$ . Finally, applying an analogous argument to the event $ℓ_{n}^{*}$ ≤ θ₀ ≤ $u_{n}^{*}$ as we applied to ℓ_n ≤ θ₀ ≤ u_n above yields the result. □

A.4. Proof of Theorem 3

Let ϵ, δ, η > 0. By (3.1) and since ${sup}_{t \in T} ∣ R_{n, t} ∣ = o_{P} (n^{- 1 ∕ 2})$ ,

n^{1 ∕ 2} ∣ {θ_{n} (t) - θ_{0} (t)} - {θ_{n} (s) - θ_{0} (s)} ∣ \leq ∣ G_{n} (ϕ_{0, t} - ϕ_{0, s}) ∣ + o_{P} (1) .

Condition (A2) implies that { $ϕ_{0, t} : t \in T$ } is uniformly mean-square continuous, in the sense that

lim_{h \to 0} sup_{‖ t - s ‖ \leq h} \int {ϕ_{0, s} (x) - ϕ_{0, t} (x)}^{2} d P_{0} (x) = 0 .

Since $T$ is totally bounded in ∥ · ∥, this also implies that { $ϕ_{0, t} : t \in T$ } is totally bounded in the L₂(P₀) metric. This, in addition to (A1), implies that : { $G_{n} ϕ_{0, t} : t \in T$ } converges weakly in $ℓ^{\infty} (T)$ to a Gaussian process $G$ with covariance function Σ₀. Furthermore, (A2) implies that this limit process is a tight element of $ℓ^{\infty} (T)$ . By Theorem 1.5.4 of van der Vaart and Wellner (1996), { $G_{n} ϕ_{0, t} : t \in T$ } is asymptotically tight. By Theorem 1.5.7 of van der Vaart and Wellner (1996), { $G_{n} ϕ_{0, t} : t \in T$ } is thus asymptotically uniformly mean-square equicontinuous in probability, in the sense that there exists some δ₀ = δ₀(ϵ, η) > 0 such that

\underset{n \to \infty}{lim sup} P_{0} [sup_{ρ (s, t) < δ_{0}} ∣ G_{n} (ϕ_{0, t} - ϕ_{0, s}) ∣ > ϵ] < η

with ρ(s, t) := [ʃ{ϕ_0,t(x) − ϕ_0,s(x)}²dP₀(x)]^1/2. By (A2), sup_{∥t−s∥≤h} ρ(t, s) < δ₀ for some h > 0. Hence, for all n large, both δn^−1/2 ≤ h and

P_{0} [sup_{ρ (s, t) < δ_{0}} ∣ G_{n} (ϕ_{0, t} - ϕ_{0, s}) ∣ > ϵ] < η,

so that

P_{0} [sup_{‖ t - s ‖ \leq δ ∕ n^{1 ∕ 2}} ∣ G_{n} (ϕ_{0, t} - ϕ_{0, s}) ∣ > ϵ] \leq P_{0} [sup_{ρ (s, t) < δ_{0}} ∣ G_{n} (ϕ_{0, t} - ϕ_{0, s}) ∣ > ϵ] < η

and the proof is complete. □

A.5. Proof of Propositions 1 and 2

Below, we refer to van der Vaart and Wellner (1996) as VW. Throughout, the symbol ≲ should be interpreted to mean ‘bounded above, up to a multiplicative constant not depending on n, t, y or a.’

We first note that condition (M) of Stupfler (2016) guarantees that the class

{x \mapsto K (\frac{x - t}{h}) : h > 0, t \in R}

is Vapnik–Chervonenkis (henceforth VC) with index 2. In addition, we define K_j := ʃ u^jK(u) du and

w_{n} (a, t_{2}) ≔ s_{2, n} (t_{2}) - s_{1, n} (t_{2}) (a - t_{2}), w_{0} (a, t_{2}) ≔ f_{0} (t_{2}) - f_{0}^{'} (t_{2}) (a - t_{2}) .

Before proving Propositions 1 and 2, we state and prove a lemma we will use.

Lemma 4. If (d)–(f) hold, $n h_{n}^{4} \to \infty$ and $n h_{n}^{5} = O (1)$ , then

(n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} ∣ \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{f_{0}^{'} (t_{2})}{f_{0} (t_{2})^{2}} ∣ \overset{P}{\to} 0, (n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} ∣ \frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{1}{f_{0} (t_{2})} ∣ \overset{P}{\to} 0, (n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} sup_{∣ a - t_{2} ∣ \leq h_{n}} ∣ \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} - \frac{w_{0} (a, t_{2})}{f_{0} (t_{2})^{2}} ∣ \overset{P}{\to} 0,

and for any δ > 0,

(n h_{n}^{5})^{1 ∕ 2} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{1 ∕ 2}} ∣ \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{1, n} (s_{2})}{g_{n} (s_{2})} ∣ \overset{P}{\to} 0, (n h_{n}^{5})^{1 ∕ 2} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{1 ∕ 2}} ∣ \frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})} ∣ \overset{P}{\to} 0 .

Proof of Lemma 4. We first show that ${sup}_{t \in T} ∣ s_{0, n} (t_{2}) - f_{0} (t_{2}) ∣ = o_{P} (h_{n})$ . We have that

s_{0, n} (t_{2}) - f_{0} (t_{2}) = h_{n}^{- 1} \int K (\frac{a - t_{2}}{h_{n}}) f_{0} (a) d a - f_{0} (t_{2}) + n^{- 1 ∕ 2} h_{n}^{- 1} G_{n} K (\frac{\cdot - t_{2}}{h_{n}}) .

By the change of variables u = (a − t₂)/h_n, we have that

h_{n}^{- 1} \int K (\frac{a - t_{2}}{h_{n}}) f_{0} (a) d a - f_{0} (t_{2}) = \int K (u) [f_{0} (t_{2}) + h_{n} u) - f_{0} (t_{2})] d u = h_{n} \int u K (u) (h_{n} u)^{- 1} R_{f}^{(1)} ((t_{1}, t_{2}), h_{n} u) d u,

which, in view of the assumed uniform negligibility of $R_{f}^{(1)}$ , tends to zero uniformly over t₂ faster than h_n. For the second term, since K is uniformly bounded and the class

{a \mapsto K (\frac{a - t_{2}}{h_{n}}) : t_{2} \in [0, 1]}

is P₀-Donsker, as implied by condition (M) of Stupfler (2016), Theorem 2.14.1 of VW implies that

sup_{t_{2}} ∣ G_{n} K (\frac{\cdot - t_{2}}{h_{n}}) ∣ = O_{P} (1) .

Then, since $n^{- 1 ∕ 2} h_{n}^{- 1} = h_{n} (n h_{n}^{4})^{- 1 ∕ 2} = o_{P} (h_{n})$ , this term is also o_P(h_n).

We next show that ( $(n h_{n}^{5})^{1 ∕ 2} {sup}_{t \in T} ∣ h_{n}^{- 2} s_{1, n} (t_{2}) - f_{0}^{'} (t_{2}) K_{2} ∣ = o_{P} (1)$ ). We have that

(n h_{n})^{1 ∕ 2} s_{1, n} (t_{2}) = (n h_{n}^{- 1})^{1 ∕ 2} \int (a - t_{2}) K (\frac{a - t_{2}}{h_{n}}) f_{0} (a) d a + h_{n}^{- 1 ∕ 2} \iint (a - t_{2}) K (\frac{a - t_{2}}{h_{n}}) G_{n} (d y, d a) .

By the change of variables u = (a − t₂)/h_n, the first term equals

(n h_{n}^{3})^{1 ∕ 2} \int u K (u) f_{0} (t_{2} + h_{n} u) d u = (n h_{n}^{3})^{1 ∕ 2} \int u K (u) [f_{0} (t_{2} + h_{n} u) - f_{0} (t) - (h_{n} u) f_{0}^{'} (t_{2})] d u + (n h_{n}^{5})^{1 ∕ 2} f_{0}^{'} (t_{2}) K_{2} = (n h_{n}^{5})^{1 ∕ 2} \int u K (u) h_{n}^{- 1} R_{f}^{(1)} ((t_{1}, t_{2}), h_{n} u) d u + (n h_{n}^{5})^{1 ∕ 2} f_{0}^{'} (t_{2}) K_{2} .

By the assumed uniform negligibility of $R_{f}^{(1)}$ and since h_n = O(n^−1/5), the first term tends to zero in probability uniformly over $t \in T$ .

Turning to the second term in s_1,n(t₂), we will apply Theorem 2.14.1 of VW to obtain a tail bound for the supremum of this empirical process over the onedimensional class indexed by t₂. We note that, since K is bounded by some $\bar{K}$ and supported on [−1, 1],

∣ (a - t_{2}) K (\frac{a - t_{2}}{h_{n}}) ∣ \leq \bar{K} ∣ a - t_{2} ∣ I (∣ a - t_{2} ∣ \leq h_{n}) \leq \bar{K} h_{n} .

Therefore, the class of functions

{(y, a) \mapsto (a - t_{2}) K (\frac{a - t_{2}}{h_{n}}) : (t_{1}, t_{2}) \in T}

has envelope $\bar{K} h_{n}$ . Furthermore, since (y, a) ↦ (a − t₂) and K are both uniformly bounded VC classes of functions, and K is bounded, the class of functions possesses finite entropy integral. Hence, we have that

E_{0} [sup_{(t_{1}, t_{2}) \in T} ∣ h_{n}^{- 1 ∕ 2} \iint (a - t_{2}) K (\frac{a - t_{2}}{h_{n}}) G_{n} (d y, d a) ∣] \leq C^{'} h_{n}^{1 ∕ 2} \overset{}{\to} 0 .

We now have that $(n h_{n})^{1 ∕ 2} {sup}_{t \in T} ∣ s_{1, n} (t_{2}) - h_{n}^{2} f_{0}^{'} (t_{2}) K_{2} ∣ = o_{P} (1)$ , which implies in particular that

sup_{t \in T} ∣ s_{1, n} (t_{2}) ∣ = (n h_{n})^{- 1 ∕ 2} o_{P} (1) + h_{n}^{2} O_{P} (1) = O_{P} ((n h_{n})^{- 1 ∕ 2}) .

Next, we show that $(n h_{n}^{5})^{1 ∕ 2} {sup}_{t \in T} ∣ h_{n}^{- 2} s_{2, n} (t_{2}) - f_{0} (t_{2}) K_{2} ∣ = o_{P} (h_{n})$ . The proof of this is nearly identical to the preceding proof. We have that

(n h_{n})^{1 ∕ 2} s_{2, n} (t_{2}) = (n h_{n}^{- 1})^{1 ∕ 2} \int (a - t_{2})^{2} K (\frac{a - t_{2}}{h_{n}}) f_{0} (a) d a + h_{n}^{- 1 ∕ 2} \iint (a - t_{2})^{2} K (\frac{a - t_{2}}{h_{n}}) G_{n} (d y, d a) .

By the change of variables u = (a − t₂)/h_n, the first term equals

(n h_{n}^{5})^{1 ∕ 2} \int u^{2} K (u) f_{0} (t_{2} + h_{n} u) d u = (n h_{n}^{5})^{1 ∕ 2} h_{n} \int u^{3} K (u) \frac{R_{f}^{(1)} (t, h_{n} u)}{h_{n} u} d u + (n h_{n}^{5})^{1 ∕ 2} f_{0} (t_{2}) K_{2} .

By the uniform negligibility of $R_{f}^{(1)}$ , the first term is o_P(h_n) uniformly in t.

Analysis of the second term in s_2,n is analogous to that of s_1,n, except that the envelope function is now $\bar{K} h_{n}^{2}$ , so that the empirical process term is $O_{P} (h_{n}^{3 ∕ 2})$ . We also note that sup_t₂ ∣s_2,n(t₂)∣ = O_P ((nh_n)^−1/2).

The above derivations imply that

(n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} ∣ h_{n}^{- 2} g_{n} (t_{2}) - f_{0} (t_{2})^{2} K_{2} ∣ \leq (n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} ∣ [h_{n}^{- 2} s_{2, n} (t_{2}) - f_{0} (t_{2}) K_{2}] s_{0, n} (t_{2}) ∣ + (n h_{n})^{1 ∕ 2} {[sup_{t \in T} ∣ s_{1, n} (t_{2}) ∣]}^{2} + (n h_{n}^{5})^{1 ∕ 2} sup_{t \in T} ∣ s_{0, n} (t_{2}) - f_{0} (t_{2})] f_{0} (t_{2}) K_{2} ∣ = o_{P} (1) O_{P} (1) + (n h_{n}^{5})^{1 ∕ 2} o_{P} (h_{n}) + (n h_{n})^{1 ∕ 2} O_{P} ((n h_{n})^{- 1}) = o_{P} (1) .

We now proceed to the statements in the lemma. We write that

∣ \frac{s_{n, 1} (t_{2})}{g_{n} (t_{2})} - \frac{f_{0}^{'} (t_{2})}{f_{0} (t_{2})^{2}} ∣ = ∣ \frac{h_{n}^{- 2} s_{n, 1} (t_{2})}{h_{n}^{- 2} g_{n} (t_{2})} - \frac{f_{0}^{'} (t_{2}) K_{2}}{f_{0} (t_{2})^{2} K_{2}} ∣ = ∣ \frac{h_{n}^{- 2} s_{n, 1} (t_{2}) - f_{0}^{'} (t_{2}) K_{2}}{h_{n}^{- 2} g_{n} (t_{2})} - f_{0}^{'} (t_{2}) K_{2} \frac{h_{n}^{- 2} g_{n} (t_{2}) - f_{0} (t_{2})^{2} K_{2}}{h_{n}^{- 2} g_{n} (t_{2}) f_{0} (t_{2})^{2} K_{2}} ∣ \leq \frac{∣ h_{n}^{- 2} s_{n, 1} (t_{2}) - f_{0}^{'} (t_{2}) K_{2} ∣}{h_{n}^{- 2} g_{n} (t_{2})} + f_{0}^{'} (t_{2}) K_{2} \frac{∣ h_{n}^{- 2} g_{n} (t_{2}) - f_{0} (t_{2})^{2} K_{2} ∣}{h_{n}^{- 2} g_{n} (t_{2}) f_{0} (t_{2})^{2} K_{2}} .

Since ${inf}_{t \in T} ∣ f_{0} (t_{2}) ∣ > 0$ , we have that

sup_{t \in T} [h_{n}^{- 2} g_{n} (t_{2})]^{- 1} = O_{P} (1) and sup_{t \in T} [h_{n}^{- 2} g_{n} (t_{2}) f_{0} (t_{2})^{2}]^{- 1} = O_{P} (1),

and the result follows.

We omit the proof of the statement about s_n,2, since it is almost identical to the above. For the statement about w_n, by the above calculations, we have that

(n h_{n}^{5})^{1 ∕ 2} sup_{(t_{1}, t_{2}) \in T} sup_{∣ a - t_{2} ∣ \leq h_{n}} ∣ h_{n}^{- 2} w_{n} (a, t_{2}) - w_{0} (a, t_{2}) K_{2} ∣ \overset{P}{\to} 0 .

We write that

∣ \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} - \frac{w_{0} (a, t_{2})}{f_{0} (t_{2})^{2}} ∣ = ∣ \frac{h_{n}^{- 2} w_{n} (a, t_{2})}{h_{n}^{- 2} g_{n} (t_{2})} - \frac{w_{0} (a, t_{2}) K_{2}}{f_{0} (t_{2})^{2} K_{2}} ∣ = ∣ \frac{h_{n}^{- 2} w_{n} (a, t_{2}) - w_{0} (a, t_{2}) K_{2}}{h_{n}^{- 2} g_{n} (t_{2})} - w_{0} (a, t_{2}) \frac{h_{n}^{- 2} g_{n} (t_{2}) - f_{0} (t_{2})^{2} K_{2}}{h_{n}^{- 2} g_{n} (t_{2}) f_{0} (t_{2})^{2}} ∣ \leq [h_{n}^{- 2} g_{n} (t_{2})]^{- 1} ∣ h_{n}^{- 2} w_{n} (a, t_{2}) - w_{0} (a, t_{2}) K_{2} ∣ + ∣ w_{0} (a, t_{2}) ∣ [h_{n}^{- 2} g_{n} (t_{2}) f_{0} (t_{2})^{2}]^{- 1} ∣ h_{n}^{- 2} g_{n} (t_{2}) - f_{0} (t_{2})^{2} K_{2} ∣

and the result follows.

We note that the above implies that

sup_{∣ t_{2} - s_{2} ∣ \leq η} ∣ s_{1, n} (t_{2}) - s_{1, n} (s_{2}) ∣ \leq 2 sup_{t_{2}} ∣ s_{1, n} (t_{2}) - h_{n}^{2} f_{0}^{'} (t_{2}) K_{2} ∣ + h_{n}^{2} sup_{∣ t_{2} - s_{2} ∣ \leq η} ∣ f_{0}^{'} (t_{2}) - f_{0}^{'} (s_{2}) ∣ K_{2} \leq o_{P} ((n h_{n})^{- 1 ∕ 2}) + h_{n}^{2} η,

so that

sup_{∣ t_{2} - s_{2} ∣ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ s_{1, n} (t_{2}) - s_{1, n} (s_{2}) ∣ = o_{p} ((n h_{n})^{- 1 ∕ 2}) .

Similarly, we have that sup_{∣t₂−s₂∣≤η} ∣s_0,n(t₂) − s_0,n(s₂)∣ = o_P (h_n) and

sup_{∣ t_{2} - s_{2} ∣ \leq η} ∣ s_{2, n} (t_{2}) - s_{2, n} (s_{2}) ∣ = o_{p} (h_{n} (n h_{n})^{- 1 ∕ 2}) .

Therefore, we find that

sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ g_{n} (t_{2}) - g_{n} (s_{2}) ∣ \leq sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ [s_{0, n} (t_{2}) - s_{0, n} (s_{2})] s_{2, n} (s_{2}) ∣ + sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ s_{0, n} (t_{2}) [s_{2, n} (t_{2}) - s_{2, n} (s_{2})] ∣ + sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ [s_{1, n} (t_{2}) - s_{2, n} (s_{2})] [s_{1, n} (t_{2}) - s_{1, n} (s_{2})] ∣ ≲ o_{P} (h_{n}) O_{P} ((n h_{n})^{- 1 ∕ 2}) + O_{P} (1) o_{P} (h_{n} (n h_{n})^{- 1 ∕ 2}) = o_{P} (h_{n} (n h_{n})^{- 1 ∕ 2}) .

We can now write that

sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{1, n} (s_{2})}{g_{n} (s_{2})} ∣ \leq h_{n}^{- 2} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ \frac{s_{1, n} (t_{2}) - s_{1, n} (s_{2})}{h_{n}^{- 2} g_{n} (t_{2})} ∣ + h_{n}^{- 4} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ s_{1, n} (s_{2}) \frac{g_{n} (t_{2}) - g_{n} (s_{2})}{h_{n}^{- 2} g_{n} (t_{2}) h_{n}^{- 2} g_{n} (s_{2})} ∣ = h_{n}^{- 2} o_{P} ((n h_{n})^{- 1 ∕ 2}) + h_{n}^{- 4} O_{P} ((n h_{n})^{- 1 ∕ 2}) o_{P} (h_{n} (n h_{n})^{- 1 ∕ 2}) = o_{P} ((n_{n}^{5})^{- 1 ∕ 2})

and

sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ \frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})} ∣ \leq h_{n}^{- 2} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ \frac{s_{2, n} (t_{2}) - s_{2, n} (s_{2})}{h_{n}^{- 2} g_{n} (t_{2})} ∣ + h_{n}^{- 4} sup_{‖ t - s ‖ \leq δ ∕ (n h_{n})^{- 1 ∕ 2}} ∣ s_{2, n} (s_{2}) \frac{g_{n} (t_{2}) - g_{n} (s_{2})}{h_{n}^{- 2} g_{n} (t_{2}) h_{n}^{- 2} g_{n} (s_{2})} ∣ = h_{n}^{- 2} o_{P} (h_{n} (n h_{n})^{- 1 ∕ 2}) + h_{n}^{- 4} O_{P} ((n h_{n})^{- 1 ∕ 2}) o_{P} (h_{n} (n h_{n})^{- 1 ∕ 2}) = o_{P} ((n_{n}^{4})^{- 1}) .

We can now prove Proposition 1.

Proof of Proposition 1. We define

m_{1, n} (t_{1}, t_{2}) ≔ h_{n}^{- 1} \iint [θ_{0} (t_{1}, a) - θ_{0} (t_{1}, t_{2})] \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} K (\frac{a - t_{2}}{h_{n}}) P_{n} (d y, d a) m_{2, n} (t_{1}, t_{2}) ≔ h_{n}^{- 1} \iint [I (y \leq t_{1}) - θ_{0} (t_{1}, a)] \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} K (\frac{a - t_{2}}{h_{n}}) P_{n} (d y, d a) .

Then, we have that θ_n(t₁, t₂) − θ₀(t₁, t₂) = m_1,n(t₁, t₂) + m_2,n(t₁, t₂). We note that, since E₀ [I(Y ≤ t₁) ∣ A = a] = θ₀(t₁, a), (nh_n)^1/2m_2,n(t₁, t₂) equals

h_{n}^{- 1 ∕ 2} \iint [I (y \leq t_{1}) - θ_{0} (t_{1}, a)] \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} K (\frac{a - t_{2}}{h_{n}}) G_{n} (d y, d a) = h_{n}^{- 1 ∕ 2} [\frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} G_{n} v_{n, t} - \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} G_{n} (ℓ_{t} v_{n, t})] .

Therefore, we can write that

(n h_{n})^{1 ∕ 2} [θ_{n} (t_{1}, t_{2}) - θ_{0} (t_{1}, t_{2})] - (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} - R_{n} (t_{1}, t_{2}) = (n h_{n})^{1 ∕ 2} m_{1, n} (t_{1}, t_{2}) - (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} .

We now proceed to analyze m_1,n. We have that (nh_n)^1/2 m_1,n(t₁, t₂) equals

(n h_{n}^{- 1})^{1 ∕ 2} \int [θ_{0} (t_{1}, a) - θ_{0} (t_{1}, t_{2})] \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} K (\frac{a - t_{2}}{h_{n}}) f_{0} (a) d a + h_{n}^{- 1 ∕ 2} \iint [θ_{0} (t_{1}, a) - θ_{0} (t_{1}, t_{2})] \frac{w_{n} (a, t_{2})}{g_{n} (t_{2})} K (\frac{a - t_{2}}{h_{n}}) G_{n} (d y, d a) .

The second term in m_1,n may be further decomposed as

h_{n}^{- 1 ∕ 2} \frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} G_{n} γ_{t, n} - h_{n}^{- 1 ∕ 2} \frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} G_{n} (ℓ_{t} γ_{t, n})

with $γ_{t, n} (y, a) ≔ [θ_{0} (t_{1}, a) - θ_{0} (t_{1}, t_{2})] K (\frac{a - t_{2}}{h_{n}})$ and ℓ_t(y, a) := a − t₂. By Lemma 4, we have that

sup_{t \in T} ∣ \frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} ∣ = O_{p} ((n h_{n}^{5})^{- 1 ∕ 2}),

and similarly for s_1,n. We will use Theorem 2.14.2 of VW to obtain bounds for ${sup}_{t \in T} ∣ G_{n} γ_{t, n} ∣$ and ${sup}_{t \in T} ∣ G_{n} (ℓ_{t} γ_{t, n}) ∣$ . We first note that, since K is bounded and supported on [−1, 1] and θ₀ is Lipschitz on $T$ , ${sup}_{t \in T} ∣ γ_{t, n} ∣ ≲ h_{n}$ and ${sup}_{t \in T} ∣ ℓ_{t} γ_{t, n} ∣ ≲ h_{n}^{2}$ . These will be our envelope functions for these classes. Next, since K is Lipschitz, we have that

∣ γ_{t, n} - γ_{s, n} ∣ \leq ∣ [θ_{0} (t_{1}, a) - θ_{0} (s_{1}, a)] - [θ_{0} (t_{1}, t_{2}) - θ_{0} (s_{1}, s_{2})] ∣ K (\frac{a - s_{2}}{h_{n}}) + ∣ θ_{0} (t_{1}, a) - θ_{0} (t_{1}, t_{2}) ∣ ∣ K (\frac{a - t_{2}}{h_{n}}) - K (\frac{a - s_{2}}{h_{n}}) ∣ ≲ ∣ t_{1} - s_{1} ∣ + ‖ t - s ‖ + ∣ t_{2} - s_{2} ∣ h_{n}^{- 1} ≲ ‖ t - s ‖ h_{n}^{- 1} .

Therefore, by VW Theorem 2.7.11, we have

N_{[]} (2 ε h_{n}^{- 1}, G_{n}, L_{2} (P_{0})) ≲ N (ε, T, ‖ \cdot ‖) ≲ ε^{- 2},

where $G_{n} ≔ {γ_{n, t} : t \in T}$ . Thus, by Theorem 2.14.2 of VW,

sup_{t \in T} ∣ G_{n} γ_{n, t} ∣ ≲ \int_{0}^{1} [N_{[]} (ε h_{n}, G_{n}, L_{2} (P_{0}))]^{1 ∕ 2} d ε h_{n} ≲ \int_{0}^{1} [- \log (ε h_{n}^{2})]^{1 ∕ 2} d ε h_{n} = h_{n}^{- 1} \int_{0}^{h_{n}^{2}} (- \log ε)^{1 ∕ 2} d ε ≲ h_{n}^{- 1} {h_{n}^{2} [\log (h_{n}^{- 2})]^{1 ∕ 2}} ≲ h_{n} (\log h_{n}^{- 1}),

where we have used the fact that $\int_{0}^{z} (\log x^{- 1})^{1 ∕ 2} d x ≲ z (\log z^{- 1})^{1 ∕ 2}$ for all z small enough. A similar argument applies to ${sup}_{t \in T} ∣ G_{n} (ℓ_{t} γ_{n, t}) ∣$ . We thus have that the second summand in m_1,n is bounded above up to a constant not depending on n and uniformly in t by

h_{n}^{- 1 ∕ 2} O_{P} ((n h_{n}^{5})^{- 1 ∕ 2}) h_{n} (\log h_{n}^{- 1})^{1 ∕ 2} = O_{P} ({(\frac{n h_{n}^{4}}{\log h_{n}^{- 1}})}^{- 1 ∕ 2}),

which is o_P (1) since $n h_{n}^{4} ∕ (\log h_{n}^{- 1}) \to \infty$ .

By the change of variables u = (a − t₂)/h_n, the first term in m_1,n equals

(n h_{n})^{1 ∕ 2} \int [θ_{0} (t_{1}, t_{2} + h_{n} u) - θ_{0} (t_{1}, t_{2})] \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) f_{0} (t_{2} + h_{n} u) d u = (n h_{n})^{1 ∕ 2} \int [R_{θ}^{(2)} (t, h_{n} u) + θ_{0, t_{2}}^{'} (t_{1}, t_{2}) (h_{n} u) + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) (h_{n} u)^{2}] \cdot [R_{f}^{(1)} (t_{2}, h_{n} u) + f_{0} (t_{2}) + f_{0}^{'} (t_{2}) (h_{n} u)] \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) d u .

Expanding the product, this is equal to

(n h_{n})^{1 ∕ 2} \int (h_{n} u) [θ_{0, t_{2}}^{'} (t_{1}, t_{2} + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) (h_{n} u)] [f_{0} (t_{2}) + f_{0}^{'} (t_{2}) h_{n} u] \cdot \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) d u + (n h_{n})^{1 ∕ 2} \int R_{θ}^{(2)} (t, h_{n} u) [f_{0} (t_{2}) + f_{0}^{'} (t_{2}) h_{n} u] \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) d u + (n h_{n})^{1 ∕ 2} \int R_{f}^{(1)} (t_{2}, h_{n} u) (h_{n} u) [θ_{0, t_{2}}^{'} (t_{1}, t_{2}) + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) h_{n} u] \cdot \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) d u + (n h_{n})^{1 ∕ 2} \int R_{θ}^{(2)} (t, h_{n} u) R_{f}^{(1)} (t_{2}, h_{n} u) \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} K (u) d u .

By the assumed negligibility of $R_{θ}^{(2)}$ and $R_{f}^{(1)}$ as well as Lemma 4, the second through fourth summands tend to zero in probability uniformly over $T$ . The first term equals

\int f_{0}^{'} (t_{2}) [θ_{0, t_{2}}^{'} (t_{1}, t_{2}) + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) (h_{n} u)] \cdot (n h_{n}^{5})^{1 ∕ 2} [\frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} - \frac{w_{0} (t_{2} + h_{n} u, t_{2})}{f_{0} (t_{2})^{2}}] u^{2} K (u) d u + (n h_{n}^{5})^{1 ∕ 2} \int f_{0}^{'} (t_{2}) [θ_{0, t_{2}}^{'} (t_{1}, t_{2}) + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) (h_{n} u)] \cdot \frac{w_{0} (t_{2} + h_{n} u, t_{2})}{f_{0} (t_{2})^{2}} u^{2} K (u) d u + (n h_{n}^{3})^{1 ∕ 2} \int f_{0}^{'} (t_{2}) [θ_{0, t_{2}}^{'} (t_{1}, t_{2}) + \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) (h_{n} u)] \cdot \frac{w_{n} (t_{2} + h_{n} u, t_{2})}{g_{n} (t_{2})} u K (u) d u .

By Lemma 4, the first summand tends to zero uniformly over $T$ . By symmetry of K, the second plus third summands simplifies to

(n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} + (n h_{n}^{5})^{1 ∕ 2} [\frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{1}{f_{0} (t_{2})}] f_{0} (t_{2}) \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} - (n h_{n}^{5})^{1 ∕ 2} [\frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{f_{0}^{'} (t_{2})}{f_{0} (t_{2})^{2}}] θ_{0, t_{2}}^{'} (t_{1}, t_{2}) f_{0} (t_{2}) K_{2} .

Once again, the second and third summands tend to zero uniformly over $T$ by Lemma 4. We have now shown that

sup_{t \in T} ∣ (n h_{n})^{1 ∕ 2} m_{1, n} (t_{1}, t_{2}) - (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} ∣ \overset{P}{\to} 0,

which completes the proof. □

Finally, we prove Proposition 2.

Proof of Proposition 2. Since $θ_{0, t_{2}}^{″}$ is uniformly continuous and $n h_{n}^{5} = O (1)$ ,

sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (t_{1}, t_{2}) K_{2} - (n h_{n}^{5})^{1 ∕ 2} \frac{1}{2} θ_{0, t_{2}}^{″} (s_{1}, s_{2}) K_{2} ∣ \overset{P}{\to} 0 .

Therefore, it only remains to show that ${sup}_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ R_{n} (t) - R_{n} (s) ∣ \overset{P}{\to} 0$ . Recalling that ℓ_t(y, a) := a − t₂ and

ν_{n, t} (y, a) ≔ [I (y \leq t_{1}) - θ_{0} (t_{1}, a)] K (\frac{a - t_{2}}{h_{n}}),

we have that R_n(t) − R_n(s) equals

[\frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})}] G_{n} ν_{n, t} - [\frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{1, n} (s_{2})}{g_{n} (s_{2})}] G_{n} (ℓ_{t} ν_{n, t}) + \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})} G_{n} (ν_{n, t} - ν_{n, s}) - \frac{s_{1, n} (s_{2})}{g_{n} (s_{2})} G_{n} (ℓ_{t} ν_{n, t} - ℓ_{s} ν_{n, s}) .

Focusing first on $G_{n} ν_{n, t}$ , we have $G_{n} ν_{n, t} = G_{n} ν_{n, t, 1} - G_{n} ν_{n, t, 2}$ for

ν_{n, t, 1} (y, a) = I (y \leq t_{1}) K (\frac{a - t_{2}}{h_{n}}) and ν_{n, t, 2} (y, a) = θ_{0} (t_{1}, a) (\frac{a - t_{2}}{h_{n}}) .

The classes

{I (y \leq t_{1}) : t \in T} and {K (\frac{a - t_{2}}{h_{n}}) : t \in T}

are both uniformly bounded above and VC. Therefore, the uniform covering numbers of the class

{I (y \leq t_{1}) K (\frac{a - t_{2}}{h_{n}}) : t \in T}

are bounded up to a constant by ε^−V for some V < ∞, so that the uniform entropy integral satisfies

J (η, G_{n, 1}) ≲ η {(\log η^{- 1})}^{1 ∕ 2}

for all η small enough, where $G_{n, 1} ≔ {ν_{n, t, 1} : t \in T}$ . We also have P₀ (ν_n,t,1)² ≲ h_n for all $t \in T$ and all n large enough. Thus, Theorem 2.1 of van der Vaart and Wellner (2011) implies that

sup_{t \in T} ∣ G_{n} ν_{n, t, 1} ∣ ≲ h_{n}^{1 ∕ 2} {(\log h_{n}^{- 1})}^{1 ∕ 2} + n^{- 1 ∕ 2} \log h_{n}^{- 1} .

For $G_{n} ν_{n, t, 2}$ , we have that

∣ ν_{n, t, 2} (y, a) - ν_{n, s, 2} (y, a) ∣ ≲ ‖ t - s ‖ (1 + h_{n}^{- 1}) \leq h_{n}^{- 1} ‖ t - s ‖

for all n large enough and all (y, a). We can therefore apply Theorem 2.7.11 of VW to conclude that $N_{[]} (2 ε h_{n}^{- 1}, G_{n, 2}, L_{2} (P_{0})) ≲ ε^{- 2}$ for all ε small enough, where $G_{n, 2} = {ν_{n, t, 2} : t \in T}$ }, which implies that $N_{[]} (ε, G_{n, 2}, L_{2} (P_{0})) ≲ (ε h_{n})^{- 2}$ . Thus, we have that

J_{[]} (η, G_{n, 2}) ≲ η {[\log (η h_{n})^{- 1}]}^{1 ∕ 2} .

Since P₀ (ν_n,t,2)² ≲ h_n as well, by Lemma 3.4.2 of VW, we then have

E_{P_{0}} sup_{t \in T} ∣ G_{n} ν_{n, t, 2} ∣ ≲ h_{n}^{1 ∕ 2} {(\log h_{n}^{- 1})}^{1 ∕ 2} + n^{- 1 ∕ 2} \log h_{n}^{- 1} .

Combining these two bounds with the last statement of Lemma 4 yields

h_{n}^{- 1 ∕ 2} sup_{t \in T} ∣ [\frac{s_{2, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})}] G_{n} ν_{n, t} ∣ ≲ h_{n}^{- 1 ∕ 2} o_{P} ((n h_{n}^{4})^{- 1}) O_{P} (h_{n}^{1 ∕ 2} (\log h_{n}^{- 1})^{1 ∕ 2} + n^{- 1 ∕ 2} \log h_{n}^{- 1}) = o_{P} (1) {[\frac{n h_{n}^{4}}{(\log h_{n}^{- 1})^{1 ∕ 2}}]}^{- 1} + o_{P} (1) {(n h_{n}^{11 ∕ 3})}^{- 3 ∕ 2} h_{n} \log h_{n}^{- 1} .

Both terms tend to zero.

The analysis for $G_{n} (ℓ_{t} ν_{n, t})$ is very similar. In this case, we have $P_{0} (ℓ_{t} ν_{n, t})^{2} ≲ h_{n}^{3}$ , so that, using the same approach as above, we get

E_{P_{0}} sup_{t \in T} ∣ G_{n} (ℓ_{t} ν_{n, t}) ∣ ≲ h_{n}^{3 ∕ 2} {(\log h_{n}^{- 1})}^{1 ∕ 2} + n^{- 1 ∕ 2} (\log h_{n}^{- 1})^{1 ∕ 2}

and therefore, in view of Lemma 4,

h_{n}^{- 1 ∕ 2} sup_{t \in T} ∣ [\frac{s_{1, n} (t_{2})}{g_{n} (t_{2})} - \frac{s_{1, n} (s_{2})}{g_{n} (s_{2})}] G_{n} (ℓ_{t} ν_{n, t}) ∣ ≲ h_{n}^{- 1 ∕ 2} o_{P} ((n h_{n}^{5})^{- 1 ∕ 2}) O_{P} (h_{n}^{3 ∕ 2} (\log h_{n}^{- 1})^{1 ∕ 2} + n^{- 1 ∕ 2} \log h_{n}^{- 1})^{1 ∕ 2}) = o_{P} (1) (n h_{n}^{4})^{- 1 ∕ 2} (h_{n} \log h_{n}^{- 1})^{1 ∕ 2} + o_{P} (1) (n h_{n}^{3})^{- 1 ∕ 2} (h_{n} \log h_{n}^{- 1})^{1 ∕ 2},

which goes to zero in probability.

It remains to bound

sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ G_{n} (ν_{n, t} - ν_{n, s}) ∣ and sup_{‖ t - s ‖ < δ ∕ r_{n}} ∣ G_{n} (ℓ_{t} ν_{n, t} - ℓ_{s} ν_{n, s}) ∣ .

For the former, we work on the terms $G_{n} (ν_{n, t, 1} - ν_{n, s, 1})$ and $G_{n} (ν_{n, t, 2} - ν_{n, s, 2})$ separately. For the first of these, we let $F_{n, δ, 2} ≔ {ν_{n, t, 1} - ν_{n, s, 1} : ‖ t - s ‖ \leq δ ∕ r_{n}}$ . We have that ∥ν_n,t,1 − ν_n,s,1∥_P₀,2 is bounded above by

{[E_{P_{0}} {[I (Y \leq t_{1}) - I (Y \leq s_{1})]^{2} K {(\frac{A - s_{2}}{h_{n}})}^{2}}]}^{1 ∕ 2} + {[E_{P_{0}} {[I (Y \leq t_{1}) {[K (\frac{A - t_{2}}{h_{n}}) - K (\frac{A - s_{2}}{h_{n}})]}^{2}}]}^{1 ∕ 2} ≲ (E_{P_{0}} {I (s_{1} < Y \leq t_{1}) I (∣ A - s_{2} ∣ \leq h_{n})})^{1 ∕ 2} + h_{n}^{- 1} ∣ t_{2} - s_{2} ∣ ≲ h_{n}^{1 ∕ 2} ∣ t_{1} - s_{1} ∣^{1 ∕ 2} + h_{n}^{- 1} ∣ t_{2} - s_{2} ∣ .

Therefore, it follows that

sup_{f \in F_{n, δ, 1}} (P_{0} f^{2})^{1 ∕ 2} ≲ (n h_{n}^{- 1})^{- 1 ∕ 4} + (n h_{n}^{3})^{- 1 ∕ 2} ≲ (n h_{n}^{3})^{- 1 ∕ 2}

for all n large enough. In addition, $F_{n, δ, 1}$ has uniform covering numbers bounded up to a constant by ε^−V for all n and δ because the classes { $I (y \leq t_{1}) : t \in T$ } and

{K (\frac{a - t_{2}}{h_{n}}) : t \in T}

are VC. Therefore, $J (η, F_{n, δ, 1}) ≲ η (\log η^{- 1})^{1 ∕ 2}$ for all η small enough. Thus, Theorem 2.1 of VW implies that

E_{P_{0}} sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ G_{n} (ν_{n, t, 1} - ν_{n, s, 1}) ∣ ≲ {[\frac{\log (n h_{n}^{3})}{n h_{n}^{3}}]}^{1 ∕ 2} + \frac{\log (n h_{n}^{3})}{n^{1 ∕ 2}} .

Turning to $G_{n} (ν_{n, t, 2} - ν_{n, s, 2})$ , we analogously define

F_{n, δ, 2} ≔ {ν_{n, t, 2} - ν_{n, s, 2} : ‖ t - s ‖ \leq δ ∕ r_{n}} .

By the Lipschitz property of θ₀ and K, we have that

∣ θ_{0} (t_{1}, a) K (\frac{a - t_{2}}{h_{n}}) - θ_{0} (s_{1}, a) K (\frac{a - s_{2}}{h_{n}}) ∣ ≲ h_{n}^{- 1} ‖ t - s ‖ .

Therefore, up to a constant, an envelope function F_n,δ,2 for $F_{n, δ, 2}$ is given by $h_{n}^{- 1} δ ∕ r_{n} ≲ (n h_{n}^{3})^{- 1 ∕ 2}$ . Next, we have, for any (t, s) and (t′, s′) in $T^{2}$ ,

∣ [θ_{0} (t_{1}, a) K (\frac{a - t_{2})}{h_{n}}) - θ_{0} (s_{1}, a) K (\frac{a - s_{2}}{h_{n}})] - [θ_{0} (t_{1}^{'}, a) K (\frac{a - t_{2}^{'}}{h_{n}}) - θ_{0} (s_{1}^{'}, a) K (\frac{a - s_{2}^{'}}{h_{n}})] ∣ \leq ∣ θ_{0} (t_{1}, a) - θ_{0} (t_{1}^{'}, a) ∣ K (\frac{a - t_{2}^{'}}{h_{n}}) + ∣ θ_{0} (t_{1}, a) ∣ ∣ K (\frac{a - t_{2}}{h_{n}}) - K (\frac{a - t_{2}^{'}}{h_{n}}) ∣ + ∣ θ_{0} (s_{1}, a) - θ_{0} (s_{1}^{'}, a) ∣ K (\frac{a - s_{2}^{'}}{h_{n}}) + ∣ θ_{0} (s_{1}, a) ∣ K (\frac{a - s_{2}}{h_{n}}) - K (\frac{a - s_{2}^{'}}{h_{n}}) ∣ ≲ ∣ t_{1} - t_{1}^{'} ∣ + h_{n}^{- 1} ∣ t_{2} - t_{2}^{'} ∣ + ∣ s_{1} - s_{1}^{'} ∣ + h_{n}^{- 1} ∣ s_{2} - s_{2}^{'} ∣ ≲ h_{n}^{- 1} ‖ (t, s) - (t^{'}, s^{'}) ‖_{T^{2}}

with $‖ (t, s) - (t^{'}, s^{'}) ‖_{T^{2}} ≔ \max {‖ t - s ‖, ‖ t^{'} - s^{'} ‖}$ . Therefore, by Theorem 2.7.11 of VW, we have that

N_{[]} (2 ε h_{n}^{- 1}, F_{n, δ, 2}, L_{2} (P_{0})) \leq N (ε, U_{δ ∕ r_{n}}, ‖ \cdot ‖_{T^{2}}),

where $U_{δ ∕ r_{n}} ≔ {(t, s) \in T^{2} : ‖ t - s ‖ \leq δ ∕ r_{n}}$ . Since $U_{δ ∕ r_{n}} \subseteq T^{2}$ , we trivially have that $N (ε, U_{δ ∕ r_{n}}, ‖ \cdot ‖_{T^{2}}) ≲ ε^{- 4}$ . Thus, it follows that

N_{[]} (ε (n h_{n}^{3})^{- 1 ∕ 2}, F_{n, δ, 2}, L_{2} (P_{0})) ≲ {[ε (n h_{n})^{- 1 ∕ 2}]}^{- 4} .

Therefore, Theorem 2.14.2 of VW implies that

E_{P_{0}} sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ G_{n} (ν_{n, t, 2} - ν_{n, s, 2}) ∣ ≲ (n h_{n}^{3})^{- 1 ∕ 2} \int_{0}^{1} {\log {[ε (n h_{n})^{- 1 ∕ 2}]}^{- 1}}^{1 ∕ 2} d ε = (n h_{n}^{3})^{- 1 ∕ 2} (n h_{n})^{1 ∕ 2} \int_{0}^{(n h_{n})^{- 1 ∕ 2}} (\log u^{- 1})^{1 ∕ 2} d u ≲ (n h_{n}^{3})^{- 1 ∕ 2} [\log (n h_{n})]^{1 ∕ 2} .

We now have that

h_{n}^{- 1 ∕ 2} sup_{‖ t - s ‖ \leq δ ∕ r_{n}} ∣ \frac{s_{2, n} (s_{2})}{g_{n} (s_{2})} G_{n} (ν_{n, t} - ν_{n, s}) ∣ ≲ h_{n}^{- 1 ∕ 2} O_{P} ((n h_{n}^{5})^{- 1 ∕ 2}) O_{P} ((n h_{n}^{3})^{- 1 ∕ 2}) [\log (n h_{n}^{3})]^{1 ∕ 2} + n^{- 1 ∕ 2} \log (n h_{n}^{3})) = O_{P} (1) {(n h_{n}^{9 ∕ 2})^{- 1} [\log (n h_{n}^{3})]^{1 ∕ 2} + (n h_{n}^{3})^{- 1} \log (n h_{n}^{3})} .

Both terms tend to zero in probability.

We can address ${sup}_{‖ t - s ‖ < δ ∕ r_{n}} ∣ G_{n} (ℓ_{t} ν_{n, t} - ℓ_{s} ν_{n, s}) ∣$ in a very similar manner. As before, we work on terms $G_{n} (ℓ_{t} ν_{n, t, 1} - ℓ_{s} ν_{n, s, 1})$ and $G_{n} (ℓ_{t} ν_{n, t, 2} - ℓ_{s} ν_{n, s, 2})$ separately. It is straightforward to see that the same line of reasoning as used above applies to each of these terms as well, yielding the same negligibility. □

Contributor Information

Ted Westling, Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, USA.

Mark J. van der Laan, Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA

Marco Carone, Department of Biostatistics, University of Washington, Seattle, Washington, USA.

References

Ayer M, Brunk HD, Ewing GM, Reid WT and Silverman E (1955). An Empirical Distribution Function for Sampling with Incomplete Information. Ann. Math. Statist 26 641–647. [Google Scholar]
Barlow RE, Bartholomew DJ, Bremner JM and Brunk HD (1972). Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley; New York. [Google Scholar]
Bril G, Dykstra R, Pillers C and Robertson T (1984). Algorithm AS 206: Isotonic Regression in Two Independent Variables. J. R. Stat. Soc. Ser. C. Appl. Stat 33 352–357. [Google Scholar]
Chernozhukov V, Fernández-Val I and Galichon A (2010). Quantile and Probability Curves Without Crossing. Econometrica 78 1093–1125. [Google Scholar]
Chernozhukov V, Fernández-Val I and Galichon A (2009). Improving point and interval estimators of monotone functions by rearrangement. Biometrika 96 559–575. [Google Scholar]
Daouia A and Park BU (2013). On Projection-type Estimators of Multivariate Isotonic Functions. Scandinavian Journal of Statistics 40 363–386. [Google Scholar]
Dette H, Neumeyer N and Pilz KF (2006). A simple nonparametric estimator of a strictly monotone regression function. Bernoulli 12 469–490. [Google Scholar]
Fan J and Gijbels I (1996). Local Polynomial Modelling and Its Applications. CRC Press, Boca Raton. [Google Scholar]
Fokianos K, Leucht A and Neumann MH (2017). On Integrated L¹ Convergence Rate of an Isotonic Regression Estimator for Multivariate Observations. arXiv e-prints arXiv:1710.04813. [Google Scholar]
Gill RD and Robins JM (2001). Causal Inference for Complex Longitudinal Data: The Continuous Case. Ann. Statist 29 1785–1811. [Google Scholar]
Hall P and Kang K-H (2001). Bootstrapping nonparametric density estimators with empirically chosen bandwidths. Ann. Statist 29 1443–1468. [Google Scholar]
Hardle W, Janssen P and Serfling R (1988). Strong Uniform Consistency Rates for Estimators of Conditional Functionals. Ann. Statist 16 1428–1449. [Google Scholar]
Kyng R, Rao A and Sachdeva S (2015). Fast, Provable Algorithms for Isotonic Regression in all ℓp-norms. In Advances in Neural Information Processing Systems 28 (Cortes C, Lawrence ND, Lee DD, Sugiyama M and Garnett R, eds.) 2719–2727. Curran Associates, Inc. [Google Scholar]
Liao X and Meyer MC (2014). coneproj: An R Package for the Primal or Dual Cone Projections with Routines for Constrained Regression. Journal of Statistical Software 61 1–22. [Google Scholar]
Meyer MC (1999). An extension of the mixed primal–dual bases algorithm to the case of more constraints than dimensions. Journal of Statistical Planning and Inference 81 13–31. [Google Scholar]
Mukarjee H and Stern S (1994). Feasible Nonparametric Estimation of Multiargument Monotone Functions. Journal of the American Statistical Association 89 77–80. [Google Scholar]
Patra RK and Sen B (2016). Estimation of a two-component mixture model with applications to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 869–893. [Google Scholar]
Robertson T, Wright F and Dykstra R (1988). Order Restricted Statistical Inference. Wiley, New York. [Google Scholar]
Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7 1393–1512. [Google Scholar]
Ruppert D, Sheather SJ and Wand MP (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association 90 1257–1270. [Google Scholar]
Stupfler G (2016). On the weak convergence of the kernel density estimator in the uniform topology. Electron. Commun. Probab 21 13 pp. [Google Scholar]
R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
Turner R (2015). Iso: Functions to Perform Isotonic Regression R package version 0.0–17. [Google Scholar]
van der Laan MJ and Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]
van der Vaart AW and Wellner JA (1996). Weak Convergence and Empirical Processes. Springer-Verlag; New York. [Google Scholar]
van der Vaart A and Wellner JA (2011). A local maximal inequality under uniform entropy. Electron. J. Statist 5 192–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wand M (2015). KernSmooth: Functions for Kernel Smoothing Supporting [Google Scholar]
Wand & Jones (1995) R package version 2.23–15. [Google Scholar]
Wand MP and Jones MC (1995). Kernel Smoothing. Chapman and Hall, London. [Google Scholar]

[R1] Ayer M, Brunk HD, Ewing GM, Reid WT and Silverman E (1955). An Empirical Distribution Function for Sampling with Incomplete Information. Ann. Math. Statist 26 641–647. [Google Scholar]

[R2] Barlow RE, Bartholomew DJ, Bremner JM and Brunk HD (1972). Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley; New York. [Google Scholar]

[R3] Bril G, Dykstra R, Pillers C and Robertson T (1984). Algorithm AS 206: Isotonic Regression in Two Independent Variables. J. R. Stat. Soc. Ser. C. Appl. Stat 33 352–357. [Google Scholar]

[R4] Chernozhukov V, Fernández-Val I and Galichon A (2010). Quantile and Probability Curves Without Crossing. Econometrica 78 1093–1125. [Google Scholar]

[R5] Chernozhukov V, Fernández-Val I and Galichon A (2009). Improving point and interval estimators of monotone functions by rearrangement. Biometrika 96 559–575. [Google Scholar]

[R6] Daouia A and Park BU (2013). On Projection-type Estimators of Multivariate Isotonic Functions. Scandinavian Journal of Statistics 40 363–386. [Google Scholar]

[R7] Dette H, Neumeyer N and Pilz KF (2006). A simple nonparametric estimator of a strictly monotone regression function. Bernoulli 12 469–490. [Google Scholar]

[R8] Fan J and Gijbels I (1996). Local Polynomial Modelling and Its Applications. CRC Press, Boca Raton. [Google Scholar]

[R9] Fokianos K, Leucht A and Neumann MH (2017). On Integrated L¹ Convergence Rate of an Isotonic Regression Estimator for Multivariate Observations. arXiv e-prints arXiv:1710.04813. [Google Scholar]

[R10] Gill RD and Robins JM (2001). Causal Inference for Complex Longitudinal Data: The Continuous Case. Ann. Statist 29 1785–1811. [Google Scholar]

[R11] Hall P and Kang K-H (2001). Bootstrapping nonparametric density estimators with empirically chosen bandwidths. Ann. Statist 29 1443–1468. [Google Scholar]

[R12] Hardle W, Janssen P and Serfling R (1988). Strong Uniform Consistency Rates for Estimators of Conditional Functionals. Ann. Statist 16 1428–1449. [Google Scholar]

[R13] Kyng R, Rao A and Sachdeva S (2015). Fast, Provable Algorithms for Isotonic Regression in all ℓp-norms. In Advances in Neural Information Processing Systems 28 (Cortes C, Lawrence ND, Lee DD, Sugiyama M and Garnett R, eds.) 2719–2727. Curran Associates, Inc. [Google Scholar]

[R14] Liao X and Meyer MC (2014). coneproj: An R Package for the Primal or Dual Cone Projections with Routines for Constrained Regression. Journal of Statistical Software 61 1–22. [Google Scholar]

[R15] Meyer MC (1999). An extension of the mixed primal–dual bases algorithm to the case of more constraints than dimensions. Journal of Statistical Planning and Inference 81 13–31. [Google Scholar]

[R16] Mukarjee H and Stern S (1994). Feasible Nonparametric Estimation of Multiargument Monotone Functions. Journal of the American Statistical Association 89 77–80. [Google Scholar]

[R17] Patra RK and Sen B (2016). Estimation of a two-component mixture model with applications to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 869–893. [Google Scholar]

[R18] Robertson T, Wright F and Dykstra R (1988). Order Restricted Statistical Inference. Wiley, New York. [Google Scholar]

[R19] Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7 1393–1512. [Google Scholar]

[R20] Ruppert D, Sheather SJ and Wand MP (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association 90 1257–1270. [Google Scholar]

[R21] Stupfler G (2016). On the weak convergence of the kernel density estimator in the uniform topology. Electron. Commun. Probab 21 13 pp. [Google Scholar]

[R22] R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]

[R23] Turner R (2015). Iso: Functions to Perform Isotonic Regression R package version 0.0–17. [Google Scholar]

[R24] van der Laan MJ and Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]

[R25] van der Vaart AW and Wellner JA (1996). Weak Convergence and Empirical Processes. Springer-Verlag; New York. [Google Scholar]

[R26] van der Vaart A and Wellner JA (2011). A local maximal inequality under uniform entropy. Electron. J. Statist 5 192–203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Wand M (2015). KernSmooth: Functions for Kernel Smoothing Supporting [Google Scholar]

[R28] Wand & Jones (1995) R package version 2.23–15. [Google Scholar]

[R29] Wand MP and Jones MC (1995). Kernel Smoothing. Chapman and Hall, London. [Google Scholar]

PERMALINK

Correcting an estimator of a multivariate monotone function with isotonic regression

Ted Westling

Mark J van der Laan

Marco Carone

Abstract

1. Introduction

1.1. Background

1.2. Contribution and organization of the article

1.3. Alternative projection procedures

2. Main results

2.1. Definitions and statistical setup

2.2. Properties of the projected estimator

2.3. Construction of confidence bands

3. Refined results under additional structure

3.1. Special case I: asymptotically linear estimators

3.2. Special case II: kernel smoothed estimators

4. Illustrative examples

4.1. Example 1: Estimation of a G-computed distribution function

Fig 1.

Table 1.

4.2. Example 2: Estimation of a conditional distribution function

Fig 2.

Table 2.

5. Discussion

Acknowledgements

Appendix A: Technical proofs

A.1. Proof of Theorem 1

A.2. Proof of Theorem 2

A.3. Proof of Corollary 1

A.4. Proof of Theorem 3

A.5. Proof of Propositions 1 and 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Correcting an estimator of a multivariate monotone function with isotonic regression

Ted Westling

Mark J van der Laan

Marco Carone

Abstract

1. Introduction

1.1. Background

1.2. Contribution and organization of the article

1.3. Alternative projection procedures

2. Main results

2.1. Definitions and statistical setup

2.2. Properties of the projected estimator

2.3. Construction of confidence bands

3. Refined results under additional structure

3.1. Special case I: asymptotically linear estimators

3.2. Special case II: kernel smoothed estimators

4. Illustrative examples

4.1. Example 1: Estimation of a G-computed distribution function

Fig 1.

Table 1.

4.2. Example 2: Estimation of a conditional distribution function

Fig 2.

Table 2.

5. Discussion

Acknowledgements

Appendix A: Technical proofs

A.1. Proof of Theorem 1

A.2. Proof of Theorem 2

A.3. Proof of Corollary 1

A.4. Proof of Theorem 3

A.5. Proof of Propositions 1 and 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases