Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 11.
Published in final edited form as: Electron J Stat. 2020 Aug 15;14(2):3032–3069. doi: 10.1214/20-ejs1740

Correcting an estimator of a multivariate monotone function with isotonic regression

Ted Westling 1, Mark J van der Laan 2, Marco Carone 3
PMCID: PMC8112587  NIHMSID: NIHMS1624095  PMID: 33981382

Abstract

In many problems, a sensible estimator of a possibly multivariate monotone function may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected confidence bands contain the true function whenever the initial bands do, at no loss to band width. Additionally, we demonstrate that the corrected estimator is asymptotically equivalent to the initial estimator if the initial estimator satisfies a stochastic equicontinuity condition and the true function is Lipschitz and strictly monotone. We provide simple sufficient conditions in the special case that the initial estimator is asymptotically linear, and illustrate the use of these results for estimation of a G-computed distribution function. Our stochastic equicontinuity condition is weaker than standard uniform stochastic equicontinuity, which has been required for alternative correction procedures. This allows us to apply our results to the bivariate correction of the local linear estimator of a conditional distribution function known to be monotone in its conditioning argument. Our experiments suggest that the projection step can yield significant practical improvements.

MSC2020 subject classifications: Primary 62G20, secondary 60G15

Keywords: Asymptotic linearity, confidence band, kernel smoothing, projection, shape constraint, stochastic equicontinuity

1. Introduction

1.1. Background

In many scientific problems, the parameter of interest is a component-wise monotone function. In practice, an estimator of this function may have several desirable statistical properties, yet fail to be monotone. This often occurs when the estimator is obtained through the pointwise application of a statistical procedure over the domain of the function. For instance, we may be interested in estimating a conditional cumulative distribution function θ0, defined pointwise as θ0(a, y) = P0(YyA = a), over its domain DR2. Here, Y may represent an outcome and A an exposure. The map yθ0(a, y) is necessarily monotone for each fixed a. In some scientific contexts, it may be known that aθ0(a, y) is also monotone for each y, in which case θ0 is a bivariate component-wise monotone function. An estimator of θ0 can be constructed by estimating the regression function (a, y) ↦ EP0 [I(Yy) ∣ A = a] for each (a, y) on a finite grid using kernel smoothing, and performing suitable interpolation elsewhere. For some types of kernel smoothing, including the Nadaraya-Watson estimator, the resulting estimator is necessarily monotone as a function of y for each value of a, but not necessarily monotone as a function of a for each value of y. For other types of kernel smoothing, including the local linear estimator, which often has smaller asymptotic bias than the Nadaraya-Watson estimator, the resulting estimator need not be monotone in either component.

Whenever the function of interest is component-wise monotone, failure of an estimator to itself be monotone can be problematic. This is most apparent if the monotonicity constraint is probabilistic in nature – that is, the parameter mapping is monotone under all possible probability distributions. This is the case, for instance, if θ0 is a distribution function. In such settings, returning a function estimate that fails to be monotone is nonsensical, like reporting a probability estimate outside the interval [0, 1]. However, even if the monotonicity constraint is based on scientific knowledge rather than probabilistic constraints, failure of an estimator to be monotone can be an issue. For example, if the parameter of interest represents average height or weight among children as a function of age, scientific collaborators would likely be unsatisfied if presented with an estimated curve that were not monotone. Finally, as we will see, there are often finite-sample performance benefits to ensuring that the monotonicity constraint is respected.

Whenever this phenomenon occurs, it is natural to seek an estimator that respects the monotonicity constraint but nevertheless remains close to the initial estimator, which may otherwise have good statistical properties. A monotone estimator can be naturally constructed by projecting the initial estimator onto the space of monotone functions with respect to some norm. A common choice is the L2-norm, which amounts to using multivariate isotonic regression to correct the initial estimator.

1.2. Contribution and organization of the article

In this article, we discuss correcting an initial estimator of a multivariate monotone function by computing the isotonic regression of the estimator over a finite grid in the domain, and interpolating between grid points. We also consider correcting an initial confidence band by using the same procedure applied to the upper and lower limits of the band. We provide three general results regarding this simple procedure.

  1. Building on the results of Robertson, Wright and Dykstra (1988) and Chernozhukov, Fernández-Val and Galichon (2009), we demonstrate that the corrected estimator is at least as good as the initial estimator, meaning:
    1. its uniform error over the grid used in defining the projection is less than or equal to that of the initial estimator for every sample;
    2. its uniform error over the entire domain is less than or equal to that of the initial estimator asymptotically;
    3. the corrected confidence band contains the true function on the projection grid whenever the initial band does, at no cost in terms of average or uniform band width.
  2. We provide high-level sufficient conditions under which the uniform difference between the initial and corrected estimators is OP(rn1) for a generic sequence rn → ∞.

  3. We provide simpler lower-level sufficient conditions in two special cases:
    1. when the initial estimator is uniformly asymptotically linear, in which case the appropriate rate is rn = n1/2;
    2. when the initial estimator is kernel-smoothed with bandwidth hn, in which case the appropriate rate is rn = (nhn)1/2 for univariate kernel smoothing.

We apply our theoretical results to two sets of examples: nonparametric efficient estimation of a G-computed distribution function for a binary exposure, and local linear estimation of a conditional distribution function with a continuous exposure.

Other authors have considered the correction of an initial estimator using isotonic regression. To name a few, Mukarjee and Stern (1994) used a projectionlike procedure applied to a kernel smoothing estimator of a regression function, whereas Patra and Sen (2016) used the projection procedure applied to a univariate cumulative distribution function in the context of a mixture model. These articles addressed the properties of the projection procedure in their specific applications. In contrast, we provide general results that are applicable broadly.

1.3. Alternative projection procedures

The projection approach is not the only possible correction procedure. Dette, Neumeyer and Pilz (2006), Chernozhukov, Fernández-Val and Galichon (2009), and Chernozhukov, Fernández-Val and Galichon (2010) studied a correction based on monotone rearrangements. However, monotone rearrangements do not generalize to the multivariate setting as naturally as projections — for example, Chernozhukov, Fernández-Val and Galichon (2009) proposed averaging a variety of possible multivariate monotone rearrangements to obtain a final monotone estimator. In contrast, the L2 projection of an initial estimator onto the space of component-wise monotone functions is uniquely defined, even in the context of multivariate functions.

Daouia and Park (2013) proposed an alternative correction procedure that consists of taking a convex combination of upper and lower monotone envelope functions, and they demonstrated conditions under which their estimator is asymptotically equivalent in supremum norm to the initial estimator. There are several differences between our contributions and those of Daouia and Park (2013). For instance, Daouia and Park (2013) did not study correction of confidence bands, which we consider in Section 2.3, or the important special case of asymptotically linear estimators, which we consider in Section 3.1. Our results in these two sections apply equally well to our correction procedure and to the correction procedure considered by Daouia and Park (2013).

Perhaps the most important theoretical contribution of our work beyond that of existing research is the weaker form of stochastic equicontinuity that we require for establishing asymptotic equivalence of the initial and projected estimators. In contrast, Daouia and Park (2013) explicitly required the usual uniform asymptotic equicontinuity, while application of the Hadamard differentiability results of Chernozhukov, Fernández-Val and Galichon (2010) requires weak convergence to a tight limit, which is stronger than uniform asymptotic equicontinuity. Our weaker condition allows us to use our general results to tackle a broader range of initial estimators, including kernel smoothed estimators, which are typically not uniformly asymptotically equicontinuous at useful rates, but nevertheless can frequently be shown to satisfy our condition. We discuss this in detail in Section 3.2. We illustrate this general contribution in Section 4.2 by studying the bivariate correction of a conditional distribution function estimated using local linear regression, which would not be possible using the stronger asymptotic equicontinuity condition. In numerical studies, we find that the projected estimator and confidence bands can offer substantial finite-sample improvements over the initial estimator and bands in this example.

2. Main results

2.1. Definitions and statistical setup

Let M be a statistical model of probability measures on a probability space (X, B). Let θ:M(T) be a parameter of interest on M, where T[0,1]d and (T) is the Banach space of bounded functions from T to R equipped with supremum norm T. We have specified this particular T for simplicity, but the results established here apply to any bounded rectangular domain TRd. For each PM, denote by θP the evaluation of θ at P and note that θP is a bounded real-valued function on T. For any tT, denote by θP(t)R the evaluation of θP at t.

For any vector tRd and 1 ≤ jd, denote by tj the jth component of t. Define the partial order ≤ on Rd by setting tt′ if and only if tjtj for each 1 ≤ jd. A function f:RdR is called (component-wise) monotone non-decreasing if tt′ implies that f(t) ≤ f(t′). Denote ∥t∥ = max1≤jdtj∣ for any vector tRd. Additionally, denote by Θ(T) the convex set of bounded monotone non-decreasing functions from T to R. For concreteness, we focus on non-decreasing functions, but all results established here apply equally to non-increasing functions.

Let M0{PM:θPΘ}M and suppose that M0 is nonempty. Generally, this inclusion is strict only if, rather than being implied by the rules of probability, the monotonicity constraint stems at least in part from prior scientific knowledge. Also, define Θ0 := {θΘ : θ = θP for some PM} ⊆ Θ. We are primarily interested in settings where Θ0 = Θ, since in this case there is no additional knowledge about θ encoded by M, and in particular there is no danger of yielding a corrected estimator that is compatible with no PM.

Suppose that observations X1, X2, … , Xn are sampled independently from an unknown distribution P0M0, and that we wish to estimate θ0 := θP0 based on these observations. Suppose that, for each tT, we have access to an estimator θn(t) of θ0(t) based on X1, X2, … , Xn. We note that the assumption that the data are independent and identically distributed is not necessary for Theorems 1 and 2 below. For any suitable f:XR, we define Pf := ʃ f(x) P(dx) and Gnfn12f(x)(PnP0)(dx), where Pn is the empirical distribution based on X1, X2, … , Xn.

The central premise of this article is that θn(t) may have desirable statistical properties for each t or even uniformly in t, but that θn as an element of (T) may not fall in Θ for any finite n or even with probability tending to one. Our goal is to provide a corrected estimator θn that necessarily falls in Θ, and yet retains the statistical properties of θn. A natural way to accomplish this is to define θn as the closest element of Θ to θn in some norm on T. Ideally, we would prefer to take θn to minimize θθnT over θΘ. However, this is not tractable for two reasons. First, optimization over the entirety of T is an infinite-dimensional optimization problem, and is hence frequently computationally intractable. To resolve this issue, for each n, we let Tn={t1,t2,,tmn}T be a finite rectangular lattice in T over which we will perform the optimization, and define and consider Tn. as the supremum norm over Tn. While it is now computationally feasible to define θn, as a minimizer over θΘ of the finite-dimensional objective function θθnTn, this objective function is challenging due to its non-differentiability. Instead, we define

θnargminθΘtTn[θ(t)θn(t)]2. (2.1)

The squared-error objective function is smooth in its arguments. In dimension d = 1, θn thus defined is simply the isotonic regression of θn on the grid Tn, which has a closed-form representation as the greatest convex minorant of the so-called cumulative sum diagram. Furthermore, since θnθnTnθn,θnTn, many of our results also apply to θn,.

We note that θn is only uniquely defined on Tn. To completely characterize θn, we must monotonically interpolate function values between elements of Tn. We will permit any monotonic interpolation that satisfies a weak condition. By the definition of a rectangular lattice, every tT can be assigned a hyper-rectangle whose vertices {s1, s2 … , s2d} are elements of Tn and whose interior has empty intersection with Tn. If multiple such hyper-rectangles exist for t, such as when t lies on the boundary of two or more such hyper-rectangles, one can be assigned arbitrarily. We will assume that, for tTn,

θn(t)=kλk,n(t)θn(sk)

for weights λ1,n(t), λ2,n(t), … , λ2d,n(t) ∈ (0, 1) such that Σk λk,n(t) = 1. In words, we assume that θn(t) is a convex combination of the values of θn on the vertices of the hyper-rectangle containing t. A simple interpolation approach consists of setting θn(t)=θn(t) with t′ the element of Tn closest to t, and choosing any such element if there are multiple elements of Tn equally close to t. This particular scheme satisfies our requirement.

Finally, for each n, we let n(t) ≤ un(t) denote lower and upper endpoints of a confidence band for θ0(t). We then define n and un as the corrected versions of n and un using the same projection and interpolation procedure defined above for obtaining θn from θn.

In dimension d = 1, θn(t), n(t), and un(t) can be obtained for tTn via the Pool Adjacent Violators Algorithm (Ayer et al., 1955), as implemented in the R command isoreg (R Core Team, 2018). In dimension d = 2, the corrections can be obtained using the algorithm described in Bril et al. (1984), which is implemented in the R command biviso in the package Iso (Turner, 2015). In dimension d ≥ 3, Kyng, Rao and Sachdeva (2015) provides algorithms for computing the isotonic regression based on embedding the points in a directed acyclic graph. Alternatively, general-purpose algorithms for minimization of quadratic criteria over convex cones have been developed and implemented in the R package coneproj and may be used in this case (Meyer, 1999; Liao and Meyer, 2014).

2.2. Properties of the projected estimator

The projected estimator θn is the isotonic regression of θn over the grid Tn. Hence, many existing finite-sample results on isotonic regression can be used to deduce properties of θn. Theorem 1 below collects a few of these properties, building upon the results of Barlow et al. (1972) and Chernozhukov, Fernández-Val and Galichon (2009). We denote ωnsuptTminsTnts as the mesh of Tn in T.

Theorem 1. (i) It holds that θnθ0Tnθnθ0Tn.

(ii) If ωn = oP(1) and θ0 is continuous on T, then

θnθ0Tθnθ0Tn+oP(1).

(iii) If there exists some α > 0 for which sups,tT:tsδθ0(t)θ0(s)=O(δα) as δ → 0, then

θnθ0Tθnθ0Tn+oP(ωnα).

(iv) If θ0(t) ∈ [n(t), un(t)] for all tTn, then θ0(t)[n(t),un(t)] for all tTn.

(v) It holds that unnTnunnTn and

tTn[un(t)n(t)]=tTn[un(t)n(t)].

Theorem 1 is proved in Appendix A.1. We remark briefly on the implications of Theorem 1. Part (i) says that the estimation error of θn over the grid Tn is never worse than that of θn, whereas parts (ii) and (iii) provide bounds on the estimation error of θn on all of T in supremum norm. In particular, part (ii) indicates that θn is uniformly consistent on T as long as θn is uniformly consistent on T, θ0 is continuous on T, and ωn = oP(1). Part (iii) provides an upper bound on the uniform rate of convergence of θnθ0, and indicates that if θ0 is known to lie in a Hölder class, then ωn can be chosen in such a way as to guarantee that the estimation error of θn on all of T is asymptotically no worse than the estimation error of θn on Tn in supremum norm. We note that parts (i)–(iii) also hold for the Lp norm with respect to uniform measure on T for any p ∈ [1, ∞). Part (iv) guarantees that the isotonized band [n, un] never has worse coverage than the original band over Tn. Finally, part (v) states that the potential increase in coverage comes at no cost to the average or supremum width of the bands over Tn. We note that parts (i), (iv) and (v) hold true for each n.

While comprehensive in scope, Theorem 1 does not rule out the possibility that θn performs strictly better, even asymptotically, than θn, or that the band [n, un] is asymptotically strictly more conservative than [n, un]. In order to construct confidence intervals or bands with correct asymptotic coverage, a stronger result is needed: it must be that θnθnT=OP(rn1), where rn is a diverging sequence such that rnθnθ0T converges in distribution to a non-degenerate limit distribution. Then, we would have that rnθnθ0T converges in distribution to this same limit, and hence confidence bands constructed using approximations of this limit distribution would have correct coverage when centered around θn, as we discuss more below.

We consider the following conditions on θ0 and the initial estimator θn:

  1. there exists a deterministic sequence rn tending to infinity such that, for all δ > 0,
    supts<δrnrn[θn(t)θ0(t)]rn[θn(s)θ0(s)]=oP(1);
  2. there exists K1 < ∞ such that ∣θ0(t) − θ0(s)∣ ≤ K1ts∥ for all t, sT;

  3. there exists K0 > 0 such that K0ts∥ ≤ ∣θ0(t) − θ0(s)∣ for all t, sT.

Based on these conditions, we have the following result.

Theorem 2. If (A)–(C) hold and ωn=OP(rn1), then θnθnT=OP(rn1).

The proof of Theorem 2 is presented in Appendix A.2. This result indicates that the projected estimator is uniformly asymptotically equivalent to the original estimator in supremum norm at the rate rn.

Condition (A) is related to, but notably weaker than, uniform stochastic equicontinuity (van der Vaart and Wellner, 1996, p. 37). (A) follows if, in particular, the process {rn[θn(t)θ0(t)]:tT} converges weakly to a tight limit in the space (T). However, the latter condition is sufficient but not necessary for (A) to hold. This is important for application of our results to kernel smoothing estimators, which typically do not converge weakly to a tight limit, but for which condition (A) nevertheless often holds. We discuss this at length in Section 4.2. The results of Daouia and Park (2013) (see in particular condition (C3) therein) and Chernozhukov, Fernández-Val and Galichon (2010) rely on uniform stochastic equicontinuity in demonstrating asymptotic equivalence of their correction procedures, which essentially limits the applicability of their procedures to estimators that converge weakly to a tight limit in (T).

Condition (B) constrains θ0 to be Lipschitz. Condition (C) constrains the variation of θ0 from below, and is slightly more restrictive than a requirement for strict monotonicity. If, for instance, θ0 is differentiable, then (C) is satisfied if all first-order partial derivatives of θ0 are bounded away from zero. Condition (C) excludes, for instance, situations in which θ0 is differentiable with null derivative over an interval. In such cases, θn may have strictly smaller variance on these intervals than θn because θn will pool estimates across the flat region while θn may not. Hence, in such cases, θn may potentially asymptotically improve on θn, so that θn and θn are not asymptotically equivalent at the rate rn. Theoretical results in these cases would be of interest, but are beyond the scope of this article.

In addition to conditions (A)–(C), Theorem 2 requires that the mesh ωn of Tn tend to zero in probability faster than rn1. Since Tn is chosen by the user, as long as rn (or an upper bound thereof) is known, this is not a problem in practice. Furthermore, except in irregular problems, the rate of convergence is typically not faster than n−1/2, and hence it is typically sufficient to set ωn = cnn−1/2 for some cn = o(1). We note, however, that the computational complexity of obtaining the isotonic regression of θn over Tn increases as ωn decreases. Hence, in cases where the rate of convergence of the initial estimator is strictly slower than n−1/2, it may be preferable to choose ωn more carefully based on a precise determination of rn. We expect this to be especially true in the context of large d and n.

We note that conditions (A)–(C) and ωn=OP(rn1) also imply that θnθnLp(T)=OP(rn1) for any p ∈ [1, ∞), where Lp(T) is the Lp norm on T with respect to uniform measure on T. However, it may be possible to relax conditions (A)–(C) for the purpose of demonstrating Lp asymptotic equivalence of θn and θn for p < ∞. It is not clear whether our method of proof of Theorem 2 is amenable to such weakening. We have chosen to focus on uniform asymptotic equivalence in part for its use in constructing uniform confidence bands for θ0, as we discuss in the next section.

2.3. Construction of confidence bands

Suppose there exists a fixed function γα:TR such that n and un satisfy:

  1. rn(θnn)γαTP0;

  2. rn(unθn)γαTP0;

  3. P0 [rnθn(t) − θ0(t)∣ ≥ γα(t) for all tT] → 1 − α.

As an example of a confidence band that satisfies conditions (a)–(c), suppose that σ0:T(0,+) is a scaling function and cα is a fixed constant such that, as n tends to infinity,

P0(rnθnθ0σ0Tcα)1α.

If σn is an estimator of σ0 satisfying σnσ0TP 0 and cα,n is an estimator of cα such that cα,nP cα, then the Wald-type band defined by lower and upper endpoints n(t)θn(t)cα,nrn1σn(t) and un(t)θn(t)+cαrn1σn(t) satisfies (a)–(c) with γα = cασ0. However, the latter conditions can also be satisfied by other types of bands, such as those constructed with a consistent bootstrap procedure.

Under conditions (a)–(c), the confidence band [n, un] has asymptotic coverage 1 − α. When conditions (A) and (B) also hold, the corrected band [n, un] has the same asymptotic coverage as the original band [n, un], as stated in the following result.

Corollary 1. If (A)–(B) and (a)–(c) hold, γα is uniformly continuous on T, and ωn=OP(rn1), then the band [n, un] has asymptotic coverage 1 − α.

The proof of Corollary 1 is presented in Appendix A.3. We also note that Theorem 2 implies that Wald-type confidence bands constructed around θn have the same asymptotic coverage if they are constructed around θn instead.

3. Refined results under additional structure

In this section, we provide more detailed conditions that imply condition (A) in two special cases: when θn is asymptotically linear, and when θn is a kernel smoothing-type estimator.

3.1. Special case I: asymptotically linear estimators

Suppose that the initial estimator θn is uniformly asymptotically linear (UAL): for each tT, there exists ϕ0,t:XR depending on P0 such that ʃ ϕ0,t dP0 = 0, ϕ0,t2dP0< and

θn(t)=θ0(t)+1ni=1nϕ0,t(Xi)+Rn,t (3.1)

for a remainder term Rn,t with n1/2 suptTRn,t=OP(1). The function ϕ0,t is the influence function of θn(t) under sampling from P0. It is desirable for θn to have representation (3.1) because this implies its uniform weak consistency as well as the pointwise asymptotic normality of n1/2 [θn(t) − θ0(t)] for each tT. If in addition the collection {ϕ0,t:tT} of influence functions forms a P0-Donsker class, then {n12[θn(t)θ0(t)]:tT} converges weakly in (T) to a Gaussian process with covariance function Σ0 : (t, s) ↦ ʃ ϕ0,t(x)ϕ0,s(x)dP0(x). Uniform asymptotic confidence bands based on θn can then be formed by using appropriate quantiles from any suitable approximation of the distribution of the supremum of the limiting Gaussian process.

We introduce two additional conditions:

(A1) the collection {ϕ0,t:tT} of influence curves is a P0-Donsker class;

(A2) Σ0 is uniformly continuous in the sense that

lim supts0Σ0(s,t)Σ0(t,t)=0.

Whenever θn is uniformly asymptotically linear, Theorem 2 can be shown to hold under (A1), (A2) and (B), as implied by the theorem below. The validity of (A1) and (A2) can be assessed by scrutinizing the influence function ϕ0,t of θn(t) for each tT. This fact renders the verification of these conditions very simple once uniform asymptotic linearity has been established.

Theorem 3. For any UAL estimator θn, (A1)–(A2) together imply (A).

The proof of Theorem 3 is provided in Appendix A.4. In Section 4.1, we illustrate the use of Theorem 3 for the estimation of a G-computed distribution function.

We note that conditions (A1)–(A2) are actually sufficient to establish uniform asymptotic equicontinuity, which as discussed above is stronger than (A). Therefore, Theorem 3 can also be used to prove asymptotic equivalence of the majorization/minorization correction procedure studied in Daouia and Park (2013).

3.2. Special case II: kernel smoothed estimators

For certain parameters, asymptotically linear estimators are not available. In particular, this is the case when the parameter of interest is not sufficiently smooth as a mapping of P0. For example, density functions, regression functions, and conditional quantile functions do not permit asymptotically linear estimators in a nonparametric model when the exposure is continuous. In these settings, a common approach to nonparametric estimation is kernel smoothing.

Recent results suggest that, as a process, the only possible weak limit of {rn[θnθ0(t)]:tT} in (T) is zero when θn is a kernel smoothed estimator. For example, in the case of the Parzen-Rosenblatt density estimator with bandwidth hn, Theorem 3 of Stupfler (2016) implies that if

cnrn(nhnloghn)120,

then {rn[θn(t)θ0(t)]:tT} converges weakly to zero in (T), whereas if cnc ∈ (0, ∞], then it does not converge weakly to a tight limit in (T). As a result, {rn[θn(t)θ0(t)]:tT} only satisfies uniform stochastic equicontinuity for rn such that cn → 0. However, for any such rate rn, rn1 is slower than the pointwise and uniform rates of convergence of θnθ0. As a result, θn and θn may not be asymptotically equivalent at the uniform rate of convergence of θnθ0, so that confidence intervals and regions based on the limit distribution of θnθ0, but centered around θn, may not have correct coverage. We note that, while Stupfler (2016) establishes formal results for the Parzen-Rosenblatt estimator, we expect that the results therein extend to a variety of kernel smoothed estimators.

As a result of the lack of uniform stochastic equicontinuity of rn(θnθ0) for useful rates rn, establishing (A) is much more difficult for kernel smoothed estimators than for asymptotically linear estimators. However, since (A) is weaker than uniform stochastic equicontinuity, it may still be possible. Here, we provide alternative sufficient conditions that imply condition (A) and that we have found useful for studying a kernel smoothed estimator θn.

When the initial estimator θn is kernel smoothed, we can often show that

suptTrn[θn(t)θ0(t)]anb0(t)Rn(t)P0, (3.2)

where b0:TR is a deterministic bias, an is sequences of positive constants, and Rn:TR is a random remainder term. We then have that

supts<δrnrn[θn(t)θ0(t)]rn[θn(s)θ0(s)]=supts<δrnanb0(t)b0(s)+supts<δrnRn(t)Rn(s)+oP(1).

If b0 is uniformly continuous on T and an = O(1), or b0 is uniformly α-Hölder on T and an=O(rnα), then the first term on the right-hand side tends to zero in probability. Attention may then be turned to demonstrating that the second term vanishes in probability. It appears difficult to provide a general characterization of the form of Rn that encompasses kernel smoothed estimators. However, in our experience, it is frequently the case that Rn(t) involves terms of the form Gnνn,t, where νn,t:XR is a deterministic function for each n ∈ {1, 2, …} and tT. In the course of demonstrating that

supts<δrnRn(t)Rn(s)P0,

a rate of convergence for

supts<δrnGn(νn,tνn,s)

is then required. Defining Fn,η{νn,tνn,s:ts<η} for each η > 0, this is equivalent to establishing a rate of convergence for the local empirical process GnTn,δrnsupξFn,δrnGnξ. Such rates can be established using tail bounds for empirical processes. We briefly comment on two approaches to obtaining such tail bounds.

We first define bracketing and covering numbers of a class of functions F — see van der Vaart and Wellner (1996) for a comprehensive treatment. We denote by ∥FP,2 = [P(F2)]1/2 the L2(P) norm of a given P-square-integrable function F:XR. The bracketing number N[](ε,F,L2(P)) of a class of functions F with respect to the L2(P) norm is the smallest number of ε-brackets needed to cover F, where an ε-bracket is any set of functions {f : fu} with and u such that ∥uP,2 < ε. The covering number N(ε,F,L2(Q)) of F with respect to the L2(Q) norm is the smallest number of ε-balls in L2(Q) required to cover F. The uniform covering number is the supremum of N(εF2,Q,F,L2(Q)) over all discrete probability measures Q such that ∥FQ,2 > 0, where F is an envelope function for F. The bracketing and uniform entropy integrals for F with respect to F are then defined as

J[](δ,F)0δ[1+logN[](εFP0,2,F,L2(P0))]12dεJ(δ,F)supQ0δ[1+logN(εFQ,2,F,L2(Q))]12dε.

We discuss two approaches to controlling GnFn,δrn using these integrals. Suppose that Fn,η has envelope function Fn,η in the sense that ∣ξ(x)∣ ≤ Fn,η for all ξFn,η and xX. The first approach is useful when ∥Fn,δ/rnP0,2 can be adequately controlled. Specifically, if either J(1,Fn,δrn) or J[](1,Fn,δrn) is O(1), then GnFn,δnMδFn,δrnP0,2 for all n and some constant Mδ ∈ (0, ∞) not depending on n by Theorems 2.14.1 and 2.14.2 of van der Vaart and Wellner (1996).

The second approach we consider is useful when the envelope functions do not shrink in expectation, but the functions in Fn,η still get smaller in the sense that γn,δsupξFn,δrnξP0,2 tends to zero. For example, if νn,t is defined as νn,t(x) := I(0 ≤ xt) for each xXR, t ∈ [0, 1], and n, then Fn,η : xI(0 ≤ x ≤ 1) is the natural envelope function for Fn,η for all n and η, so that ∥Fn,δ/rnP0,2 does not tend to zero. However, if the density p0 corresponding to P0 is bounded above by p¯0, then γn,δ2p¯0δrn, which does tend to zero. In these cases, the basic tail bounds in Theorem 2.14.1 and 2.14.2 of van der Vaart and Wellner (1996) are too weak. Sharper, but slightly more complicated, bounds may be used instead. Specifically, if Fn,δ/rnC < ∞ for all n large enough and either

J(γn,δ,Fn,δrn)+J(γn,δ,Fn,δrn)2γn,δ2n12orJ[](γn,δ,Fn,δrn)+J[](γn,δ,Fn,δrn)2γn,δ2n12

are o(zn1), then GnFn,δrn=oP(zn1) by Lemma 3.4.2 of van der Vaart and Wellner (1996) and Theorem 2.1 of van der Vaart and Wellner (2011). Analogous statements hold if these expressions are O(zn1).

In some cases, both of these approaches must be used to control different terms arising within Rn(t), as for the conditional distribution function discussed in Section 4.2.

4. Illustrative examples

4.1. Example 1: Estimation of a G-computed distribution function

We first demonstrate the use of Theorem 3 in the particular problem in which we wish to draw inference on a G-computed distribution function. Suppose that the data unit is the vector X = (Y, A, W), where Y is an outcome, A ∈ {0, 1} is an exposure, and W is a vector of baseline covariates. The observed data consist of independent draws X1, X2, … , Xn from P0M, where M is a nonparametric model.

For PM and a0 ∈ {0, 1}, we define the parameter value θP,a0 pointwise as θP,a0(t) := EP {P (YtA = a0, W)} the G-computed distribution function of Y evaluated at t, where the outer expectation is over the marginal distribution of W under P. We are interested in estimating θ0,a0 := θP0,a0. This parameter is often of interest as an interpretable marginal summary of the relationship between Y and A accounting for the potential confounding induced by W. Under certain causal identification conditions, θ0,a0 is the distribution function of the counterfactual outcome Y(a0) defined by the intervention that deterministically sets exposure to A = a0 (Robins, 1986; Gill and Robins, 2001).

For each t, the parameter PθP,a0 (t) is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function φP,a0,t at PM is given by

(y,a,w)I(a=a0)gP(a0w)[I(yt)Q¯P(ta0,w)]+Q¯P(ta0,w)θP,a0(t),

where gP(a0w) := P(A = a0W = w) is the propensity score and Q¯P(ta0,w)P(YtA=a0,W=w) is the conditional exposure-specific distribution function, as implied by P (van der Laan and Robins, 2003). Given estimators gn and Q¯n of g0 := gP0 and Q¯0Q¯P0, respectively, several approaches can be used to construct, for each t, an asymptotically linear estimator of θ0(t) with influence function ϕ0,a0,t = φP0,a0,t. For example, the use of either optimal estimating equations or the one-step correction procedure leads to the doubly-robust augmented inverse-probability-of-weighting estimator

θn,a0(t)1ni=1n{I(Ai=a0)gn(a0Wi)[I(Yit)Q¯n(ta0,Wi)]+Q¯n(ta0,Wi)},

as discussed in detail in van der Laan and Robins (2003). Under conditions on gn and Q¯n, including consistency at fast enough rates, θn,a0(t) is asymptotically efficient relative to M. In this case, θn,a0(t) satisfies (3.1) with influence function ϕ0,a0,t. However, there is no guarantee that θn,a0 is monotone.

In the context of this example, we can identify simple sufficient conditions under which conditions (A)–(B), and hence the asymptotic equivalence of the initial and isotonized estimators of the G-computed distribution function, are guaranteed. Specifically, we find this to be the case when both:

  1. there exists η > 0 such that g0(a0W) ≥ η almost surely under P0;

  2. there exist non-negative real-valued functions K1, K2 such that

K1(w)tsQ¯0(ta0,w)Q¯0(sa0,w)K2(w)ts

for all t, sT, and such that, under P0, K1(W) is strictly positive with non-zero probability and K2(W) has finite second moment.

We conducted a simulation study to validate our theoretical results in the context of this particular example. For samples sizes 100, 250, 500, 750, and 1000, we generated 1000 random datasets as follows. We first simulated a bivariate covariate W with independent components W1 and W2, respectively distributed as a Bernoulli variate with success probability 0.5 and a uniform variate on (−1, 1). Given W = (w1, w2), exposure A was simulated from a logistic regression model with

P0(A=1W1=w1,W2=w2)=expit(0.5+w12w2).

Given W = (w1, w2) and A = a, Y was simulated as the inverse-logistic transformation of a normal variate with mean 0.2 − 0.3a − 4w2 and variance 0.3.

For each simulated dataset, we estimated θ0,0(t) and θ0,1(t) for t equal to each outcome value observed between 0.1 and 0.9. To do so, we used the estimator described above, with propensity score and conditional exposure-specific distribution function estimated using correctly-specified parametric models. We employed two correction procedures for the estimators θn,0 and θn,1. First, we projected θn,0 and θn,1 onto the space of monotone functions separately. Second, noting that θ0,0(t) ≤ θ0,1(t) for all t, so that (a, t) ↦ θ0,a(t) is component-wise monotone for this particular data-generating distribution, we considered the projection of (a, t) ↦ θn,a(t) onto the space of bivariate monotone functions on {0,1}×T. For each simulation and each projection procedure, we recorded the maximal absolute differences between (i) the initial and and projected estimates, (ii) the initial estimate and the truth, and (iii) the projected estimate and the truth. We also recorded the maximal widths of the initial and projected confidence bands.

Figure 1 displays the results of this simulation study, with output from the univariate and bivariate projection approaches summarized in the top and bottom rows, respectively. The left column displays the empirical distribution of the scaled maximum absolute discrepancy between θn and θn for all sample sizes studied. This plot confirms that the discrepancy between these two estimators indeed decreases faster than n−1/2, as our theory suggests. Furthermore, for each n, the discrepancy is larger for the two-dimensional projection.

Fig 1.

Fig 1.

Summary of simulation results for G-computed distribution function. Each plot shows cumulative distributions of a particular discrepancy over 1000 simulated datasets for different values of n. Left panel: maximal absolute difference between the initial and isotonic estimators over the grid used for projecting, scaled up by root-n. Middle panel: ratio of the maximal absolute difference between the initial estimator and the truth and the maximal absolute difference between the isotonic estimator and the truth. Right panel: ratio of the maximal width of the initial confidence band and the maximal width of the isotonic confidence band. The top row shows the results for the univariate projection, and the bottom row shows the results for the bivariate projection.

The middle column of Figure 1 displays the empirical distribution function of the ratio between the maximum discrepancy between θn and θ0 and that of θn and θ0. This plot confirms that θn is always at least as close to θ0 than is θn over Tn. The maximum discrepancy between θn and θ0 can be more than 25% larger than that between θn and θ0 in the univariate case, and up to 50% larger in the bivariate case.

The right column of Figure 1 displays the empirical distribution function of the ratio between the maximum size of the initial uniform 95% influence function-based confidence band and that of the isotonic band. For large samples, the maximal widths are often close, but for smaller samples, the initial confidence bands can be up to 50% larger than the isotonic bands, especially for the bivariate case. The empirical coverage of both bands is provided in Table 1. The coverage of the isotonic band is essentially the same as the initial band for the univariate case, whereas it is slightly larger than that of the initial band in the bivariate case.

Table 1.

Coverage of 95% confidence bands for the true counterfactual distribution function.

n 100 250 500 750 1000
d=1 Initial band 92.5 94.1 96.0 94.5 95.5
Monotone band 92.5 94.1 96.0 94.5 95.5
d=2 Initial band 93.9 94.0 95.0 94.6 94.9
Monotone band 95.7 95.9 95.5 95.3 95.1

4.2. Example 2: Estimation of a conditional distribution function

We next demonstrate the use of Theorem 2 with dimension d = 2 for drawing inference on a conditional distribution function. Suppose that the data unit is the vector X = (A, Y), where Y is an outcome and A is now a continuous exposure. The observed data consist of independent draws (A1, Y1), (A2, Y2), … , (An, Yn), from P0M, where M is a nonparametric model. We define the parameter value θP pointwise as θP(t1, t2) := P (Yt1A = t2). Thus, θP is the conditional distribution function of Y at t1 given A = t2. The map (t1, t2) ↦ θP(t1, t2) is necessarily monotone in t1 for each fixed t2, and in some settings, it may be known that it is also monotone in t2 for each fixed t1. This parameter completely describes the conditional distribution of Y given A, and can be used to obtain the conditional mean, conditional quantiles, or any other conditional parameter of interest.

For each t1, the true function θ0(t1, t2) = θP0 (t1, t2) may be written as the conditional mean of I(Yt1) given A = t2. Hence, any method of nonparametric regression can be used to estimate t2θ0(t1, t2) for fixed t1, and repeating such a method over a grid of values of t1 yields an estimator of the entire function. We expect that our results would apply to many of these methods. Here, we consider the local linear estimator (Fan and Gijbels, 1996), which may be expressed as

θn(t1,t2)1nhni=1nI(Yit1)[s2,n(t2)s1,n(t2)(Ait2)s0,n(t2)s2,n(t2)s1,n(t2)2]K(Ait2hn),

where K:RR is a symmetric and bounded kernel function, hn → 0 is a sequence of bandwidths, and

sj,n(t2)1nhni=1n(Ait2)jK(Ait2hn)

for j ∈ {0, 1, 2}. Under regularity conditions on the true distribution function θ0, the marginal density f0 of A, the bandwidth sequence hn, and the kernel function K, for any fixed (t1, t2), θn satisfies

(nhn)12[θn(t1,t2)θ0(t1,t2)hn2VKb0(t1,t2)]dN(0,SKv0(t1,t2)),

where VK := ʃ x2K(x)dx is the variance of K, SK := ʃ K(x)2dx, and b0(t1, t2) and v0(t1, t2) depend on the derivatives of θ0 and on f0. If hn is chosen to be of order n−1/5, the rate that minimizes the asymptotic mean integrated squared error of θn relative to θ0, then n2/5 [θn(t1, t2) − θ0(t1, t2)] converges in law to a normal random variate with mean VKb0(t1, t2) and variance SKv0(t1, t2). Under stronger regularity conditions, the rate of convergence of the uniform norm θnθ0T can be shown to be (nhn/ log n)1/2 (Hardle, Janssen and Serfling, 1988).

Theorem 3 cannot be used to establish (A) in this problem, since θn is not an asymptotically linear estimator. Furthermore, as discussed above, recent results suggest that {{rn[θn(t)θ0(t)]:tT}} does not converge weakly to a tight limit in (T) for any useful rate rn. Despite this lack of weak convergence, condition (A) can be verified directly in the context of this example under smoothness conditions on θ0 and f0 using the tail bounds for empirical processes outlined in Section 3.2. Denoting by θ0,t2 and θ0,t2 the first and second derivatives of θ0 with respect to its second argument, we define

Rθ(2)(t,δ)θ0(t1,t2+δ)θ0(t1,t2)δθ0,t2(t1,t2)12δ2θ0,t2(t1,t2)

and Rf(1)(t,δ)f0(t2+δ)f0(t2)δf0(t2), where f0 is the derivative of f0. We then introduce the following conditions on θ0, f0, and K:

(d) θ0,t2 exists and is continuous on T, and as δ → 0, suptTRθ(2)(t,δ)=o(δ2);

(e) inftTf0(t)>0, f0 exists and is continuous on T, and suptTRf(1)(t,δ)=o(δ);

(f) K is a Lipschitz function supported on [−1, 1] satisfying condition (M) of Stupfler (2016).

We also define

νn,t(y,a)[I(yt1)θ0(t1,a)]K(at2hn);gn(t2)s0,n(t2)s2,n(t2)s1,n(t2)2;Rn(t)hn12[s2,n(t2)gn(t2)Gnνn,ts1,n(t2)gn(t2)Gn(tνn,t)].

We then have the following result.

Proposition 1. If (d)–(f) hold, nhn4loghn1 and nhn5=O(1), then

suptT(nhn)12[θn(t1,t2)θ0(t1,t2)](nhn5)1212θ0,t2(t1,t2)K2Rn(t)P0.

Proposition 1 aids in establishing the following result, which formally establishes asymptotic equivalence of the local linear estimator of a conditional distribution function and its correction obtained via isotonic regression at the rate rn = (nhn)1/2.

Proposition 2. If (d)–(f) hold and nhn5c(0,), then (A) holds for the local linear estimator with rn = (nhn)1/2.

The proofs of Propositions 1 and 2 are provided in Appendix A.5. These results may also be of interest in their own right for establishing other properties of the local linear estimator.

As with the first example, we conducted a simulation study to validate our theoretical results. For samples sizes n ∈ {100, 250, 500, 750, 1000}, we generated 1000 random datasets as follows. We first simulated A as a Beta(2, 3) variate. Given A = a, Y was simulated as the inverse-logistic transformation of a normal variate with mean 0.5 × [1 + (a − 1.2)2] and variance one.

For each simulated dataset, we estimated θ0(y, a) for each (y, a) in an equally spaced square grid of mesh ωn = n−4/5. For each unique y in this grid, we estimated the function aθ0(y, a) using the local linear estimator, as implemented in the R package KernSmooth (Wand, 2015; Wand and Jones, 1995). For each value of y in the grid, we computed the optimal bandwidth based on the direct plug-in methodology of Ruppert, Sheather and Wand (1995) as implemented by the dpill function, and we then set our bandwidth as the average of these y-specific bandwidths. We constructed initial confidence bands using a variable-width nonparametric bootstrap (Hall and Kang, 2001).

We first note that, for all sample sizes considered, over 99% of simulations had monotonicity violations in both the y- and a-directions. Figure 2 displays the results of this simulation study. The left exhibit of Figure 2 confirms that the discrepancy between θn and θn decreases faster than rn1=n25, as our theory suggests. The middle exhibit indicates that in roughly 50% of simulations, there is less than 5% difference between θnθ0Tn and θnθ0Tn, but even for n = 1000, in roughly 25% of simulations, θn offers at least a 25% improvement in estimation error. In smaller samples, the estimation error of θn is less than half that of θn in 5–10% of simulations. The rightmost exhibit indicates that the projected confidence bands regularly reduce the uniform size of the initial bands by 10–20%. Finally, the empirical coverage of uniform 95% bootstrap-based bands and their projected versions is provided in Table 2. As before, the projected band is always more conservative than the initial band, and the difference in coverage diminishes as n grows. However, the initial bands in this example are anti-conservative, even at n = 1000, likely due to the slower rate of convergence, and the corrected bands offer a much more substantial improvement in this example than in the first.

Fig 2.

Fig 2.

Summary of simulation results for conditional distribution function. The three columns display the same results as those in Figure 1.

Table 2.

Coverage of 95% confidence bands for the true conditional distribution function.

n 100 250 500 750 1000
Initial band 37.6 64.9 83.2 86.3 89.7
Monotone band 60.8 80.4 90.3 92.3 93.9

5. Discussion

Many estimators of function-valued parameters in nonparametric and semiparametric models are not guaranteed to respect shape constraints on the true function. A simple and general solution to this problem is to project the initial estimator onto the constrained parameter space over a grid whose mesh goes to zero fast enough with sample size. However, this introduces the possibility that the projected estimator has different properties than the original estimator. In this paper, we studied the important shape constraint of multivariate component-wise monotonicity. We provided results indicating that the projected estimator is generically no worse than the initial estimator, and that if the true function is strictly increasing and the initial estimator possesses a relatively weak type of stochastic equicontinuity, the projected estimator is uniformly asymptotically equivalent to the initial estimator. We provided especially simple sufficient conditions for this latter result when the initial estimator is uniformly asymptotically linear, and provided guidance on establishing the key condition for kernel smoothed estimators.

We studied the application of our results in two examples: estimation of a G-computed distribution function, for use in understanding the effect of a binary exposure on an outcome when the exposure-outcome relationship is confounded by recorded covariates, and of a conditional distribution function, for use in characterizing the marginal dependence of an outcome on a continuous exposure. In numerical studies, we found that the projected estimator yielded improvements over the initial estimator. The improvements were especially strong in the latter example.

In our examples, we only studied corrections in dimensions d = 1 and d = 2. In future work, it would be interesting to consider corrections in dimensions higher than 2. For example, for the conditional distribution function, it would be of interest to study multivariate local linear estimators for a continuous exposure A taking values in Rd1 for d > 2. Since tailored algorithms for computing the isotonic regression do not yet exist for d > 2, it would also be of interest to determine whether a version of Theorem 2 could be established for the relaxed isotonic estimator proposed by Fokianos, Leucht and Neumann (2017). Alternatively, it is possible that the uniform stochastic equicontinuity currently required by Chernozhukov, Fernández-Val and Galichon (2010) and Daouia and Park (2013) for asymptotic equivalence of the rearrangement- and envelope-based corrections, respectively, could be relaxed along the lines of our condition (A). Finally, our theoretical results do not give the exact asymptotic behavior of the projected estimator or projected confidence band when the true function possesses flat regions. This is also an interesting topic for future research.

Acknowledgements

The authors gratefully acknowledge the constructive comments of the editors and anonymous reviewers as well as grant support from the National Institute of Allergy and Infectious Diseases (NIAID) and the National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of Health.

Supported by NIAID grant UM1AI068635.

Supported by NIAID grant R01AI074345.

Supported by NHLBI grant R01HL137808.

Appendix A: Technical proofs

A.1. Proof of Theorem 1

Part (i) follows from Corollary B to Theorem 1.6.1 of Robertson, Wright and Dykstra (1988). For parts (ii) and (iii), we note that by assumption

θn(t)θ0(t)kλk,n(t)θn(sk)θ0(sk)+kλk,n(t)θ0(sk)θ0(t)

for every tT, where Σk λk,n(t) = 1, and for each k, skTn and ∥skt∥ ≤ 2ωn. By part (i), the first term is bounded above by supsTnθn(s)θ0(s). The second term is bounded above by γ(2ωn), where we define

γ(δ)sup{θ0(t)θ0(s):t,sT,tsδ}.

If θ0 is continuous on T, then it is also uniformly continuous since T is compact. Therefore, γ(δ) → γ(0) = 0 as δ → 0, so that γ(2ωn) →P 0 if ωnP 0. If γ(δ) = o(δα) as δ → 0, then γ(2ωn)=oP(ωnα).

Part (iv) follows from the proof of Proposition 3 of Chernozhukov, Fernández-Val and Galichon (2009), which applies to any order-preserving monotonization procedure. For the first statement of (v), by their definition as minimizers of the least-squares criterion function, we note that tTnun(t)=tTnun(t), and similarly for n. The second statement of (v) follows from a slight modification of Theorem 1.6.1 of Robertson, Wright and Dykstra (1988). As stated, the result says that tTnG(θ(t)θ(t))tTnG(θ(t)ψ(t)) for any convex function G:RR and monotone function ψ, where θ* is the isotonic regression of θ over Tn. A straightforward adaptation of the proof indicates that tTnG(θ1(t)θ2(t))tTnG(θ1(t)θ2(t)), where now θ1 and θ2 are the isotonic regressions of θ1 and θ2 over Tn, respectively. As in Corollary B, taking G(x) = ∣xp and letting p → ∞ yields that θ1θ2Tnθ1θ2Tn. Applying this with θ1 = un and θ2 = n establishes the second portion of (v). □

A.2. Proof of Theorem 2

We prove Theorem 2 via three lemmas, which may be of interest in their own right. The first lemma controls the size of deviations in θn over small neighborhoods, and does not hinge on condition (C) holding.

Lemma 1. If (A)–(B) hold and bn=oP(rn1), then

suptsbnθn(t)θ0(s)=oP(rn1).

Proof of Lemma 1. In view of the triangle inequality,

θn(t)θn(s){θn(t)θ0(t)}{θn(s)θ0(s)}+θ0(t)θ0(s).

The first term is oP(rn1) by (A), whereas the second term is oP(rn1) by (B). □

The second lemma controls the size of neighborhoods over which violations in monotonicity can occur. Henceforth, we define

κnsup{ts:s,tT,st,θn(t)θn(s)}.

In this lemma we again require (A) but now require (C) rather than (B).

Lemma 2. If (A) and (C) hold, then κn=oP(rn1).

Proof of Lemma 2. Let ϵ > 0 and ηn := ϵ/rn. Suppose that κn > ηn. Then, there exist s, tT with s < t and ∥ts∥ > ηn such that θn(s) ≥ θn(t). We claim that there must also exist s*, tT with s* < t* and ∥t* − s*∥ ∈ [ηn/2, ηn] such that θn(s*) ≥ θn(t*). To see this, let J = ⌊∥ts∥/(ηn/2)⌋ − 1, and note that J ≥ 1. Define tj := s+(n/2)(ts)/∥ts∥ for j = 0, 1, … , J, and set tJ+1 := t. Thus, tj < tj+1 and ∥tj+1tj∥ ∈ [ηn/2, ηn] for each j = 0, 1, … , J. Since then j=0J[θn(tj+1)θn(tj)]=θn(t)θn(s)0, it must be that θn(tj+1) ≤ θn (tj) for at least one j. This proves the claim.

We now have that κn > ηn implies that there exist s, tT with s < t and ∥ts∥ ∈ [ηn/2, ηn] such that θn(s) ≥ θn(t). This further implies that

{θn(t)θ0(t)}{θn(s)θ0(s)}{θ0(t)θ0(s)}K0tsK0ηn2

by condition (B). Finally, this allows us to write

P0(κn>ϵrn)P0{suptsϵrnrn[θn(t)θ0(t)][θn(s)θ0(s)]K0ϵ2}.

By condition (A), this probability tends to zero for every ϵ > 0, which completes the proof. □

Our final lemma bounds the maximal absolute deviation between θn and θn over the grid Tn in terms of the supremal deviations of θn over neighborhoods smaller than κn. This lemma does not depend on any of the conditions (A)–(C).

Lemma 3. It holds that maxtTnθn(t)θn(t)supstκnθn(s)θn(t).

Proof of Lemma 3. By Theorem 1.4.4 of Robertson, Wright and Dykstra (1988), for any tTn,

θn(t)=maxUUtminLLtθn(UL)=minLLtmaxUUtθn(UL),

where, for any finite set STn, θn(S) is defined as ∣S−1 ΣsS θn(s). The sets U range over the collection Un of upper sets of Tn containing t, where UTn is called an upper set if t1U, t2Tn and t1t2 implies t2U. The sets L range over the collection Lt of lower sets of Tn containing t, where LTn is called a lower set if t1L, t2Tn and t2t1 implies t2L.

Let Ut := {s : st} and Lt := {s : st}. First, suppose there exists L0Lt and s0L0 with s0 > t and ∥ts0∥ > κn. Then, we claim that there exists another lower set L0Lt such that θn(UtL0) > θn (UtL0). If θn(UtL0) > θn(t) = θn(UtLt), then L0=Lt satisfies the claim. Otherwise, if θn(UtL0) ≤ θn(t), let

L0L0{s:s>t,ts>κn}.

One can verify that L0Lt, and since s0L0 \ L0, L0 is a strict subset of L0. Furthermore, by definition of κn, θn(s) > θn(t) for all s > t such that ∥ts∥ > κn, and since θn(UtL0) ≤ θn(t), removing these elements from L0 can only reduce the average, so that θn(UtL0) < θn(UtL0). This establishes the claim. By an analogous argument, we can show that if there exists U0Ut and s0U0 with s0 < t and ∥ts0∥ > κn, then there exists another upper set U0Ut such that θn(U0Lt) < θn(U0Lt).

Let LargminLLtθn(UtL) and UargmaxUUtθn(ULt). Then,

θn(t)=maxUUtminLLtθn(UL)minLLtθn(UtL)=θn(UtL)andθn(t)=minLLtmaxUUtθn(UL)maxUUtθn(ULt)=θn(ULt).

Hence, θn(UtL)θn(t)θn(ULt). By the above argument, we have that both

θn(UtL)inf{θn(s):st,tsκn}andθn(ULt)sup{θn(s):st,tsκn}.

Therefore, we find that

inf{θn(s)θn(t):tsκn}θn(t)θn(t)sup{θn(s)θn(t):tsκn},

and thus, θn(t)θn(t)sup{θn(s)θn(t):tsκn}. Taking the maximum over tTn yields the claim. □

The proof of Theorem 2 follows easily from Lemmas 1, 2 and 3.

Proof of Theorem 2. By construction, for each tT, we can write

θn(t)θn(t)Σj=12dλj,n(t)θn(sj)θn(sj)+Σj=12dλj,n(t)θn(sj)θn(t),

where sjTn and ∥sjt∥ ≤ 2ωn for all t, sj by definition. Thus, since Σj λj,n(t) = 1, it follows that

suptTθn(t)θn(t)maxtTnθn(t)θn(t)+supst2ωnθn(s)θn(t).

By Lemma 3, the first summand is bounded above by supst∥≤κnθn(s) − θn(t)∣, which is oP(rn1) by Lemmas 1 and 2. The second summand is oP(rn1) by Lemma 1. □

A.3. Proof of Corollary 1

We note that n(t) ≤ θ0(t) ≤ un (t) if and only if

{rn[θn(t)n(t)]γα(t)}+γα(t)rn[θn(t)θ0(t)]γα(t){rn[un(t)θn(t)]γα(t)}.

Therefore, by conditions (a)–(c), P0[n(t)θ0(t)un(t)for alltT]1α. Next, we let δ > 0 and note that

suptsδrnrn{n(t)θn(t)}rn{n(s)θ0(s)}suptsδrnrn{θn(t)θ0(t)}rn{θn(s)θ0(s)}+suptsδrnγα(t)γα(s)+2rn(θnn)γαT.

The first term tends to zero in probability by (A), the second by conditions (a)–(c), and the third by the assumed uniform continuity of γα. An analogous decomposition holds for un. Therefore, we can apply Theorem 2 with un and n in place of θn to find that nnT=oP(rn1) and ununT=oP(rn1). Finally, applying an analogous argument to the event nθ0un as we applied to nθ0un above yields the result. □

A.4. Proof of Theorem 3

Let ϵ, δ, η > 0. By (3.1) and since suptTRn,t=oP(n12),

n12{θn(t)θ0(t)}{θn(s)θ0(s)}Gn(ϕ0,tϕ0,s)+oP(1).

Condition (A2) implies that {ϕ0,t:tT} is uniformly mean-square continuous, in the sense that

limh0suptsh{ϕ0,s(x)ϕ0,t(x)}2dP0(x)=0.

Since T is totally bounded in ∥ · ∥, this also implies that {ϕ0,t:tT} is totally bounded in the L2(P0) metric. This, in addition to (A1), implies that : {Gnϕ0,t:tT} converges weakly in (T) to a Gaussian process G with covariance function Σ0. Furthermore, (A2) implies that this limit process is a tight element of (T). By Theorem 1.5.4 of van der Vaart and Wellner (1996), {Gnϕ0,t:tT} is asymptotically tight. By Theorem 1.5.7 of van der Vaart and Wellner (1996), {Gnϕ0,t:tT} is thus asymptotically uniformly mean-square equicontinuous in probability, in the sense that there exists some δ0 = δ0(ϵ, η) > 0 such that

lim supnP0[supρ(s,t)<δ0Gn(ϕ0,tϕ0,s)>ϵ]<η

with ρ(s, t) := [ʃ{ϕ0,t(x) − ϕ0,s(x)}2dP0(x)]1/2. By (A2), supt−s∥≤h ρ(t, s) < δ0 for some h > 0. Hence, for all n large, both δn−1/2h and

P0[supρ(s,t)<δ0Gn(ϕ0,tϕ0,s)>ϵ]<η,

so that

P0[suptsδn12Gn(ϕ0,tϕ0,s)>ϵ]P0[supρ(s,t)<δ0Gn(ϕ0,tϕ0,s)>ϵ]<η

and the proof is complete. □

A.5. Proof of Propositions 1 and 2

Below, we refer to van der Vaart and Wellner (1996) as VW. Throughout, the symbol ≲ should be interpreted to mean ‘bounded above, up to a multiplicative constant not depending on n, t, y or a.’

We first note that condition (M) of Stupfler (2016) guarantees that the class

{xK(xth):h>0,tR}

is Vapnik–Chervonenkis (henceforth VC) with index 2. In addition, we define Kj := ʃ ujK(u) du and

wn(a,t2)s2,n(t2)s1,n(t2)(at2),w0(a,t2)f0(t2)f0(t2)(at2).

Before proving Propositions 1 and 2, we state and prove a lemma we will use.

Lemma 4. If (d)–(f) hold, nhn4 and nhn5=O(1), then

(nhn5)12suptTs1,n(t2)gn(t2)f0(t2)f0(t2)2P0,(nhn5)12suptTs2,n(t2)gn(t2)1f0(t2)P0,(nhn5)12suptTsupat2hnwn(a,t2)gn(t2)w0(a,t2)f0(t2)2P0,

and for any δ > 0,

(nhn5)12suptsδ(nhn)12s1,n(t2)gn(t2)s1,n(s2)gn(s2)P0,(nhn5)12suptsδ(nhn)12s2,n(t2)gn(t2)s2,n(s2)gn(s2)P0.

Proof of Lemma 4. We first show that suptTs0,n(t2)f0(t2)=oP(hn). We have that

s0,n(t2)f0(t2)=hn1K(at2hn)f0(a)daf0(t2)+n12hn1GnK(t2hn).

By the change of variables u = (at2)/hn, we have that

hn1K(at2hn)f0(a)daf0(t2)=K(u)[f0(t2)+hnu)f0(t2)]du=hnuK(u)(hnu)1Rf(1)((t1,t2),hnu)du,

which, in view of the assumed uniform negligibility of Rf(1), tends to zero uniformly over t2 faster than hn. For the second term, since K is uniformly bounded and the class

{aK(at2hn):t2[0,1]}

is P0-Donsker, as implied by condition (M) of Stupfler (2016), Theorem 2.14.1 of VW implies that

supt2GnK(t2hn)=OP(1).

Then, since n12hn1=hn(nhn4)12=oP(hn), this term is also oP(hn).

We next show that ((nhn5)12suptThn2s1,n(t2)f0(t2)K2=oP(1)). We have that

(nhn)12s1,n(t2)=(nhn1)12(at2)K(at2hn)f0(a)da+hn12(at2)K(at2hn)Gn(dy,da).

By the change of variables u = (at2)/hn, the first term equals

(nhn3)12uK(u)f0(t2+hnu)du=(nhn3)12uK(u)[f0(t2+hnu)f0(t)(hnu)f0(t2)]du+(nhn5)12f0(t2)K2=(nhn5)12uK(u)hn1Rf(1)((t1,t2),hnu)du+(nhn5)12f0(t2)K2.

By the assumed uniform negligibility of Rf(1) and since hn = O(n−1/5), the first term tends to zero in probability uniformly over tT.

Turning to the second term in s1,n(t2), we will apply Theorem 2.14.1 of VW to obtain a tail bound for the supremum of this empirical process over the onedimensional class indexed by t2. We note that, since K is bounded by some K¯ and supported on [−1, 1],

(at2)K(at2hn)K¯at2I(at2hn)K¯hn.

Therefore, the class of functions

{(y,a)(at2)K(at2hn):(t1,t2)T}

has envelope K¯hn. Furthermore, since (y, a) ↦ (at2) and K are both uniformly bounded VC classes of functions, and K is bounded, the class of functions possesses finite entropy integral. Hence, we have that

E0[sup(t1,t2)Thn12(at2)K(at2hn)Gn(dy,da)]Chn120.

We now have that (nhn)12suptTs1,n(t2)hn2f0(t2)K2=oP(1), which implies in particular that

suptTs1,n(t2)=(nhn)12oP(1)+hn2OP(1)=OP((nhn)12).

Next, we show that (nhn5)12suptThn2s2,n(t2)f0(t2)K2=oP(hn). The proof of this is nearly identical to the preceding proof. We have that

(nhn)12s2,n(t2)=(nhn1)12(at2)2K(at2hn)f0(a)da+hn12(at2)2K(at2hn)Gn(dy,da).

By the change of variables u = (a − t2)/hn, the first term equals

(nhn5)12u2K(u)f0(t2+hnu)du=(nhn5)12hnu3K(u)Rf(1)(t,hnu)hnudu+(nhn5)12f0(t2)K2.

By the uniform negligibility of Rf(1), the first term is oP(hn) uniformly in t.

Analysis of the second term in s2,n is analogous to that of s1,n, except that the envelope function is now K¯hn2, so that the empirical process term is OP(hn32). We also note that supt2s2,n(t2)∣ = OP ((nhn)−1/2).

The above derivations imply that

(nhn5)12suptThn2gn(t2)f0(t2)2K2(nhn5)12suptT[hn2s2,n(t2)f0(t2)K2]s0,n(t2)+(nhn)12[suptTs1,n(t2)]2+(nhn5)12suptTs0,n(t2)f0(t2)]f0(t2)K2=oP(1)OP(1)+(nhn5)12oP(hn)+(nhn)12OP((nhn)1)=oP(1).

We now proceed to the statements in the lemma. We write that

sn,1(t2)gn(t2)f0(t2)f0(t2)2=hn2sn,1(t2)hn2gn(t2)f0(t2)K2f0(t2)2K2=hn2sn,1(t2)f0(t2)K2hn2gn(t2)f0(t2)K2hn2gn(t2)f0(t2)2K2hn2gn(t2)f0(t2)2K2hn2sn,1(t2)f0(t2)K2hn2gn(t2)+f0(t2)K2hn2gn(t2)f0(t2)2K2hn2gn(t2)f0(t2)2K2.

Since inftTf0(t2)>0, we have that

suptT[hn2gn(t2)]1=OP(1)andsuptT[hn2gn(t2)f0(t2)2]1=OP(1),

and the result follows.

We omit the proof of the statement about sn,2, since it is almost identical to the above. For the statement about wn, by the above calculations, we have that

(nhn5)12sup(t1,t2)Tsupat2hnhn2wn(a,t2)w0(a,t2)K2P0.

We write that

wn(a,t2)gn(t2)w0(a,t2)f0(t2)2=hn2wn(a,t2)hn2gn(t2)w0(a,t2)K2f0(t2)2K2=hn2wn(a,t2)w0(a,t2)K2hn2gn(t2)w0(a,t2)hn2gn(t2)f0(t2)2K2hn2gn(t2)f0(t2)2[hn2gn(t2)]1hn2wn(a,t2)w0(a,t2)K2+w0(a,t2)[hn2gn(t2)f0(t2)2]1hn2gn(t2)f0(t2)2K2

and the result follows.

We note that the above implies that

supt2s2ηs1,n(t2)s1,n(s2)2supt2s1,n(t2)hn2f0(t2)K2+hn2supt2s2ηf0(t2)f0(s2)K2oP((nhn)12)+hn2η,

so that

supt2s2δ(nhn)12s1,n(t2)s1,n(s2)=op((nhn)12).

Similarly, we have that supt2s2∣≤ηs0,n(t2) − s0,n(s2)∣ = oP (hn) and

supt2s2ηs2,n(t2)s2,n(s2)=op(hn(nhn)12).

Therefore, we find that

suptsδ(nhn)12gn(t2)gn(s2)suptsδ(nhn)12[s0,n(t2)s0,n(s2)]s2,n(s2)+suptsδ(nhn)12s0,n(t2)[s2,n(t2)s2,n(s2)]+suptsδ(nhn)12[s1,n(t2)s2,n(s2)][s1,n(t2)s1,n(s2)]oP(hn)OP((nhn)12)+OP(1)oP(hn(nhn)12)=oP(hn(nhn)12).

We can now write that

suptsδ(nhn)12s1,n(t2)gn(t2)s1,n(s2)gn(s2)hn2suptsδ(nhn)12s1,n(t2)s1,n(s2)hn2gn(t2)+hn4suptsδ(nhn)12s1,n(s2)gn(t2)gn(s2)hn2gn(t2)hn2gn(s2)=hn2oP((nhn)12)+hn4OP((nhn)12)oP(hn(nhn)12)=oP((nn5)12)

and

suptsδ(nhn)12s2,n(t2)gn(t2)s2,n(s2)gn(s2)hn2suptsδ(nhn)12s2,n(t2)s2,n(s2)hn2gn(t2)+hn4suptsδ(nhn)12s2,n(s2)gn(t2)gn(s2)hn2gn(t2)hn2gn(s2)=hn2oP(hn(nhn)12)+hn4OP((nhn)12)oP(hn(nhn)12)=oP((nn4)1).

We can now prove Proposition 1.

Proof of Proposition 1. We define

m1,n(t1,t2)hn1[θ0(t1,a)θ0(t1,t2)]wn(a,t2)gn(t2)K(at2hn)Pn(dy,da)m2,n(t1,t2)hn1[I(yt1)θ0(t1,a)]wn(a,t2)gn(t2)K(at2hn)Pn(dy,da).

Then, we have that θn(t1, t2) − θ0(t1, t2) = m1,n(t1, t2) + m2,n(t1, t2). We note that, since E0 [I(Yt1) ∣ A = a] = θ0(t1, a), (nhn)1/2m2,n(t1, t2) equals

hn12[I(yt1)θ0(t1,a)]wn(a,t2)gn(t2)K(at2hn)Gn(dy,da)=hn12[s2,n(t2)gn(t2)Gnvn,ts1,n(t2)gn(t2)Gn(tvn,t)].

Therefore, we can write that

(nhn)12[θn(t1,t2)θ0(t1,t2)](nhn5)1212θ0,t2(t1,t2)K2Rn(t1,t2)=(nhn)12m1,n(t1,t2)(nhn5)1212θ0,t2(t1,t2)K2.

We now proceed to analyze m1,n. We have that (nhn)1/2 m1,n(t1, t2) equals

(nhn1)12[θ0(t1,a)θ0(t1,t2)]wn(a,t2)gn(t2)K(at2hn)f0(a)da+hn12[θ0(t1,a)θ0(t1,t2)]wn(a,t2)gn(t2)K(at2hn)Gn(dy,da).

The second term in m1,n may be further decomposed as

hn12s2,n(t2)gn(t2)Gnγt,nhn12s1,n(t2)gn(t2)Gn(tγt,n)

with γt,n(y,a)[θ0(t1,a)θ0(t1,t2)]K(at2hn) and t(y, a) := at2. By Lemma 4, we have that

suptTs2,n(t2)gn(t2)=Op((nhn5)12),

and similarly for s1,n. We will use Theorem 2.14.2 of VW to obtain bounds for suptTGnγt,n and suptTGn(tγt,n). We first note that, since K is bounded and supported on [−1, 1] and θ0 is Lipschitz on T, suptTγt,nhn and suptTtγt,nhn2. These will be our envelope functions for these classes. Next, since K is Lipschitz, we have that

γt,nγs,n[θ0(t1,a)θ0(s1,a)][θ0(t1,t2)θ0(s1,s2)]K(as2hn)+θ0(t1,a)θ0(t1,t2)K(at2hn)K(as2hn)t1s1+ts+t2s2hn1tshn1.

Therefore, by VW Theorem 2.7.11, we have

N[](2εhn1,Gn,L2(P0))N(ε,T,)ε2,

where Gn{γn,t:tT}. Thus, by Theorem 2.14.2 of VW,

suptTGnγn,t01[N[](εhn,Gn,L2(P0))]12dεhn01[log(εhn2)]12dεhn=hn10hn2(logε)12dεhn1{hn2[log(hn2)]12}hn(loghn1),

where we have used the fact that 0z(logx1)12dxz(logz1)12 for all z small enough. A similar argument applies to suptTGn(tγn,t). We thus have that the second summand in m1,n is bounded above up to a constant not depending on n and uniformly in t by

hn12OP((nhn5)12)hn(loghn1)12=OP((nhn4loghn1)12),

which is oP (1) since nhn4(loghn1).

By the change of variables u = (a − t2)/hn, the first term in m1,n equals

(nhn)12[θ0(t1,t2+hnu)θ0(t1,t2)]wn(t2+hnu,t2)gn(t2)K(u)f0(t2+hnu)du=(nhn)12[Rθ(2)(t,hnu)+θ0,t2(t1,t2)(hnu)+12θ0,t2(t1,t2)(hnu)2][Rf(1)(t2,hnu)+f0(t2)+f0(t2)(hnu)]wn(t2+hnu,t2)gn(t2)K(u)du.

Expanding the product, this is equal to

(nhn)12(hnu)[θ0,t2(t1,t2+12θ0,t2(t1,t2)(hnu)][f0(t2)+f0(t2)hnu]wn(t2+hnu,t2)gn(t2)K(u)du+(nhn)12Rθ(2)(t,hnu)[f0(t2)+f0(t2)hnu]wn(t2+hnu,t2)gn(t2)K(u)du+(nhn)12Rf(1)(t2,hnu)(hnu)[θ0,t2(t1,t2)+12θ0,t2(t1,t2)hnu]wn(t2+hnu,t2)gn(t2)K(u)du+(nhn)12Rθ(2)(t,hnu)Rf(1)(t2,hnu)wn(t2+hnu,t2)gn(t2)K(u)du.

By the assumed negligibility of Rθ(2) and Rf(1) as well as Lemma 4, the second through fourth summands tend to zero in probability uniformly over T. The first term equals

f0(t2)[θ0,t2(t1,t2)+12θ0,t2(t1,t2)(hnu)](nhn5)12[wn(t2+hnu,t2)gn(t2)w0(t2+hnu,t2)f0(t2)2]u2K(u)du+(nhn5)12f0(t2)[θ0,t2(t1,t2)+12θ0,t2(t1,t2)(hnu)]w0(t2+hnu,t2)f0(t2)2u2K(u)du+(nhn3)12f0(t2)[θ0,t2(t1,t2)+12θ0,t2(t1,t2)(hnu)]wn(t2+hnu,t2)gn(t2)uK(u)du.

By Lemma 4, the first summand tends to zero uniformly over T. By symmetry of K, the second plus third summands simplifies to

(nhn5)1212θ0,t2(t1,t2)K2+(nhn5)12[s2,n(t2)gn(t2)1f0(t2)]f0(t2)12θ0,t2(t1,t2)K2(nhn5)12[s1,n(t2)gn(t2)f0(t2)f0(t2)2]θ0,t2(t1,t2)f0(t2)K2.

Once again, the second and third summands tend to zero uniformly over T by Lemma 4. We have now shown that

suptT(nhn)12m1,n(t1,t2)(nhn5)1212θ0,t2(t1,t2)K2P0,

which completes the proof. □

Finally, we prove Proposition 2.

Proof of Proposition 2. Since θ0,t2 is uniformly continuous and nhn5=O(1),

suptsδrn(nhn5)1212θ0,t2(t1,t2)K2(nhn5)1212θ0,t2(s1,s2)K2P0.

Therefore, it only remains to show that suptsδrnRn(t)Rn(s)P0. Recalling that t(y, a) := at2 and

νn,t(y,a)[I(yt1)θ0(t1,a)]K(at2hn),

we have that Rn(t) − Rn(s) equals

[s2,n(t2)gn(t2)s2,n(s2)gn(s2)]Gnνn,t[s1,n(t2)gn(t2)s1,n(s2)gn(s2)]Gn(tνn,t)+s2,n(s2)gn(s2)Gn(νn,tνn,s)s1,n(s2)gn(s2)Gn(tνn,tsνn,s).

Focusing first on Gnνn,t, we have Gnνn,t=Gnνn,t,1Gnνn,t,2 for

νn,t,1(y,a)=I(yt1)K(at2hn)andνn,t,2(y,a)=θ0(t1,a)(at2hn).

The classes

{I(yt1):tT}and{K(at2hn):tT}

are both uniformly bounded above and VC. Therefore, the uniform covering numbers of the class

{I(yt1)K(at2hn):tT}

are bounded up to a constant by ε−V for some V < ∞, so that the uniform entropy integral satisfies

J(η,Gn,1)η(logη1)12

for all η small enough, where Gn,1{νn,t,1:tT}. We also have P0 (νn,t,1)2hn for all tT and all n large enough. Thus, Theorem 2.1 of van der Vaart and Wellner (2011) implies that

suptTGnνn,t,1hn12(loghn1)12+n12loghn1.

For Gnνn,t,2, we have that

νn,t,2(y,a)νn,s,2(y,a)ts(1+hn1)hn1ts

for all n large enough and all (y, a). We can therefore apply Theorem 2.7.11 of VW to conclude that N[](2εhn1,Gn,2,L2(P0))ε2 for all ε small enough, where Gn,2={νn,t,2:tT}}, which implies that N[](ε,Gn,2,L2(P0))(εhn)2. Thus, we have that

J[](η,Gn,2)η[log(ηhn)1]12.

Since P0 (νn,t,2)2hn as well, by Lemma 3.4.2 of VW, we then have

EP0suptTGnνn,t,2hn12(loghn1)12+n12loghn1.

Combining these two bounds with the last statement of Lemma 4 yields

hn12suptT[s2,n(t2)gn(t2)s2,n(s2)gn(s2)]Gnνn,thn12oP((nhn4)1)OP(hn12(loghn1)12+n12loghn1)=oP(1)[nhn4(loghn1)12]1+oP(1)(nhn113)32hnloghn1.

Both terms tend to zero.

The analysis for Gn(tνn,t) is very similar. In this case, we have P0(tνn,t)2hn3, so that, using the same approach as above, we get

EP0suptTGn(tνn,t)hn32(loghn1)12+n12(loghn1)12

and therefore, in view of Lemma 4,

hn12suptT[s1,n(t2)gn(t2)s1,n(s2)gn(s2)]Gn(tνn,t)hn12oP((nhn5)12)OP(hn32(loghn1)12+n12loghn1)12)=oP(1)(nhn4)12(hnloghn1)12+oP(1)(nhn3)12(hnloghn1)12,

which goes to zero in probability.

It remains to bound

supts<δrnGn(νn,tνn,s)andsupts<δrnGn(tνn,tsνn,s).

For the former, we work on the terms Gn(νn,t,1νn,s,1) and Gn(νn,t,2νn,s,2) separately. For the first of these, we let Fn,δ,2{νn,t,1νn,s,1:tsδrn}. We have that ∥νn,t,1νn,s,1P0,2 is bounded above by

[EP0{[I(Yt1)I(Ys1)]2K(As2hn)2}]12+[EP0{[I(Yt1)[K(At2hn)K(As2hn)]2}]12(EP0{I(s1<Yt1)I(As2hn)})12+hn1t2s2hn12t1s112+hn1t2s2.

Therefore, it follows that

supfFn,δ,1(P0f2)12(nhn1)14+(nhn3)12(nhn3)12

for all n large enough. In addition, Fn,δ,1 has uniform covering numbers bounded up to a constant by ε−V for all n and δ because the classes {I(yt1):tT} and

{K(at2hn):tT}

are VC. Therefore, J(η,Fn,δ,1)η(logη1)12 for all η small enough. Thus, Theorem 2.1 of VW implies that

EP0suptsδrnGn(νn,t,1νn,s,1)[log(nhn3)nhn3]12+log(nhn3)n12.

Turning to Gn(νn,t,2νn,s,2), we analogously define

Fn,δ,2{νn,t,2νn,s,2:tsδrn}.

By the Lipschitz property of θ0 and K, we have that

θ0(t1,a)K(at2hn)θ0(s1,a)K(as2hn)hn1ts.

Therefore, up to a constant, an envelope function Fn,δ,2 for Fn,δ,2 is given by hn1δrn(nhn3)12. Next, we have, for any (t, s) and (t′, s′) in T2,

[θ0(t1,a)K(at2)hn)θ0(s1,a)K(as2hn)][θ0(t1,a)K(at2hn)θ0(s1,a)K(as2hn)]θ0(t1,a)θ0(t1,a)K(at2hn)+θ0(t1,a)K(at2hn)K(at2hn)+θ0(s1,a)θ0(s1,a)K(as2hn)+θ0(s1,a)K(as2hn)K(as2hn)t1t1+hn1t2t2+s1s1+hn1s2s2hn1(t,s)(t,s)T2

with (t,s)(t,s)T2max{ts,ts}. Therefore, by Theorem 2.7.11 of VW, we have that

N[](2εhn1,Fn,δ,2,L2(P0))N(ε,Uδrn,T2),

where Uδrn{(t,s)T2:tsδrn}. Since UδrnT2, we trivially have that N(ε,Uδrn,T2)ε4. Thus, it follows that

N[](ε(nhn3)12,Fn,δ,2,L2(P0))[ε(nhn)12]4.

Therefore, Theorem 2.14.2 of VW implies that

EP0suptsδrnGn(νn,t,2νn,s,2)(nhn3)1201{log[ε(nhn)12]1}12dε=(nhn3)12(nhn)120(nhn)12(logu1)12du(nhn3)12[log(nhn)]12.

We now have that

hn12suptsδrns2,n(s2)gn(s2)Gn(νn,tνn,s)hn12OP((nhn5)12)OP((nhn3)12)[log(nhn3)]12+n12log(nhn3))=OP(1){(nhn92)1[log(nhn3)]12+(nhn3)1log(nhn3)}.

Both terms tend to zero in probability.

We can address supts<δrnGn(tνn,tsνn,s) in a very similar manner. As before, we work on terms Gn(tνn,t,1sνn,s,1) and Gn(tνn,t,2sνn,s,2) separately. It is straightforward to see that the same line of reasoning as used above applies to each of these terms as well, yielding the same negligibility. □

Contributor Information

Ted Westling, Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, USA.

Mark J. van der Laan, Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA

Marco Carone, Department of Biostatistics, University of Washington, Seattle, Washington, USA.

References

  1. Ayer M, Brunk HD, Ewing GM, Reid WT and Silverman E (1955). An Empirical Distribution Function for Sampling with Incomplete Information. Ann. Math. Statist 26 641–647. [Google Scholar]
  2. Barlow RE, Bartholomew DJ, Bremner JM and Brunk HD (1972). Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley; New York. [Google Scholar]
  3. Bril G, Dykstra R, Pillers C and Robertson T (1984). Algorithm AS 206: Isotonic Regression in Two Independent Variables. J. R. Stat. Soc. Ser. C. Appl. Stat 33 352–357. [Google Scholar]
  4. Chernozhukov V, Fernández-Val I and Galichon A (2010). Quantile and Probability Curves Without Crossing. Econometrica 78 1093–1125. [Google Scholar]
  5. Chernozhukov V, Fernández-Val I and Galichon A (2009). Improving point and interval estimators of monotone functions by rearrangement. Biometrika 96 559–575. [Google Scholar]
  6. Daouia A and Park BU (2013). On Projection-type Estimators of Multivariate Isotonic Functions. Scandinavian Journal of Statistics 40 363–386. [Google Scholar]
  7. Dette H, Neumeyer N and Pilz KF (2006). A simple nonparametric estimator of a strictly monotone regression function. Bernoulli 12 469–490. [Google Scholar]
  8. Fan J and Gijbels I (1996). Local Polynomial Modelling and Its Applications. CRC Press, Boca Raton. [Google Scholar]
  9. Fokianos K, Leucht A and Neumann MH (2017). On Integrated L1 Convergence Rate of an Isotonic Regression Estimator for Multivariate Observations. arXiv e-prints arXiv:1710.04813. [Google Scholar]
  10. Gill RD and Robins JM (2001). Causal Inference for Complex Longitudinal Data: The Continuous Case. Ann. Statist 29 1785–1811. [Google Scholar]
  11. Hall P and Kang K-H (2001). Bootstrapping nonparametric density estimators with empirically chosen bandwidths. Ann. Statist 29 1443–1468. [Google Scholar]
  12. Hardle W, Janssen P and Serfling R (1988). Strong Uniform Consistency Rates for Estimators of Conditional Functionals. Ann. Statist 16 1428–1449. [Google Scholar]
  13. Kyng R, Rao A and Sachdeva S (2015). Fast, Provable Algorithms for Isotonic Regression in all ℓp-norms. In Advances in Neural Information Processing Systems 28 (Cortes C, Lawrence ND, Lee DD, Sugiyama M and Garnett R, eds.) 2719–2727. Curran Associates, Inc. [Google Scholar]
  14. Liao X and Meyer MC (2014). coneproj: An R Package for the Primal or Dual Cone Projections with Routines for Constrained Regression. Journal of Statistical Software 61 1–22. [Google Scholar]
  15. Meyer MC (1999). An extension of the mixed primal–dual bases algorithm to the case of more constraints than dimensions. Journal of Statistical Planning and Inference 81 13–31. [Google Scholar]
  16. Mukarjee H and Stern S (1994). Feasible Nonparametric Estimation of Multiargument Monotone Functions. Journal of the American Statistical Association 89 77–80. [Google Scholar]
  17. Patra RK and Sen B (2016). Estimation of a two-component mixture model with applications to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 869–893. [Google Scholar]
  18. Robertson T, Wright F and Dykstra R (1988). Order Restricted Statistical Inference. Wiley, New York. [Google Scholar]
  19. Robins J (1986). A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7 1393–1512. [Google Scholar]
  20. Ruppert D, Sheather SJ and Wand MP (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association 90 1257–1270. [Google Scholar]
  21. Stupfler G (2016). On the weak convergence of the kernel density estimator in the uniform topology. Electron. Commun. Probab 21 13 pp. [Google Scholar]
  22. R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  23. Turner R (2015). Iso: Functions to Perform Isotonic Regression R package version 0.0–17. [Google Scholar]
  24. van der Laan MJ and Robins JM (2003). Unified methods for censored longitudinal data and causality. Springer Science & Business Media. [Google Scholar]
  25. van der Vaart AW and Wellner JA (1996). Weak Convergence and Empirical Processes. Springer-Verlag; New York. [Google Scholar]
  26. van der Vaart A and Wellner JA (2011). A local maximal inequality under uniform entropy. Electron. J. Statist 5 192–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wand M (2015). KernSmooth: Functions for Kernel Smoothing Supporting [Google Scholar]
  28. Wand & Jones (1995) R package version 2.23–15. [Google Scholar]
  29. Wand MP and Jones MC (1995). Kernel Smoothing. Chapman and Hall, London. [Google Scholar]

RESOURCES