Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Mar 13.
Published in final edited form as: Electron J Stat. 2011;5(2011):192–203. doi: 10.1214/11-EJS605

A local maximal inequality under uniform entropy

Aad van der Vaart 1, Jon A Wellner 2,*
PMCID: PMC3299941  NIHMSID: NIHMS359099  PMID: 22423315

Abstract

We derive an upper bound for the mean of the supremum of the empirical process indexed by a class of functions that are known to have variance bounded by a small constant δ. The bound is expressed in the uniform entropy integral of the class at δ. The bound yields a rate of convergence of minimum contrast estimators when applied to the modulus of continuity of the contrast functions.

Keywords: Empirical process, modulus of continuity, minimum contrast estimator, rate of convergence

1. Introduction

The empirical measure Pn and empirical process Gn of a sample of observations X1, … , Xn from a probability measure P on a measurable space (X,A) attach to a given measurable function f:XR the numbers

Pnf=1ni=1nf(Xi),Gnf=1ni=1n(f(Xi)Pf).

It is often useful to study the suprema of these stochastic processes over a given class F of measurable functions. The distribution of the supremum

GnF:=supfFnf

is known to concentrate near its mean value, at a rate depending on the size of the envelope function of the class F, but irrespective of its complexity. On the other hand, the mean value of GnF depends on the size of the class F. Entropy integrals, of which there are two basic versions, are useful tools to bound this mean value.

The uniform entropy integral was introduced in [9] and [5], following [3], in their study of the abstract version of Donsker’s theorem. We define an Lr-version of it as

J(δ,F,Lr)=supQ0δ1+logN(εFQ,r,F,Lr(Q))dε.

Here the supremum is taken over all finitely discrete probability distributions Q on (X,A), the covering number N(ε,F,Lr(Q)) is the minimal number of balls of radius ε in Lr(Q) needed to cover F, F is an envelope function of F, and fQ,r denotes the norm of a function f in Lr(Q). The integral is defined relative to an envelope function, which need not be the minimal one, but can be any measurable function F:XR such that |f| ≤ F for every fF. If multiple envelope functions are under consideration, then we write J(δ,FF,Lr) to stress this dependence. An inequality, due to Pollard (also see [12], 2.14.1), says, under some measurability assumptions, that

EPGnFJ(1,F,L2)FP,2. (1.1)

Here ≲ means smaller than up to a universal constant. This shows that for a class F with finite uniform entropy integral, the supremum GnF is not essentially bigger than a multiple of the empirical process GnF at the envelope function F. The inequality is particularly useful if this envelope function is small.

The bracketing entropy integral has its roots in the Donsker theorem of [8], again following initial work by Dudley. For a given norm it can be defined as

J[](δ,F,)=0δ1+logN[](εF,F,)dε.

Here the bracketing number N[](ε,F,) is the minimal number of brackets [l,u]={f:XR:lfu} of size ul smaller than ε needed to cover F. A useful inequality, due to Pollard (also see [12], 2.14.2), is

EPGnFJ[](1,F,L2(P))FP,2. (1.2)

Bracketing numbers are bigger than covering numbers (at twice the size), and hence the bracketing integral is bigger than a multiple of the corresponding entropy integral. However, the bracketing integral involves only the single distribution P, whereas the uniform entropy integral takes a supremum over all (discrete) distributions, making the two integrals incomparable in general. Apart from this difference the two maximal inequalities have the same message.

The two inequalities (1.1) and (1.2) involve the size of the envelope function, but not the sizes of the individual functions in the class F. They also exploit finiteness of the entropy integrals only, roughly requiring that the entropy grows at smaller order than ε−2 as ε ↓ 0, and not the precise size of the entropy. In the case of the bracketing integral this is remedied in the equality (see [12], 3.4.2), valid for any class of functions f:X[1,1] with Pf2δ2PF2 and any δ ∈ (0, 1),

EPGnFJ[](δ,F,L2(P))FP,2(1+J[](δ,F,L2(P))δ2nFP,2). (1.3)

Here the assumption that the class of functions is uniformly bounded is too restrictive for some applications, but can be removed if the entropy integral is computed relative to the stronger “norm”

fP,B=(2P(ef1f))12.

Although it is not a norm, this quantity can be used to define the size of brackets and hence bracketing numbers. Inequality (1.3) is valid for an arbitrary class of functions with fP,BδFP,B if the L2(P)-norm is replaced by P,B in its right side (at four appearances) (see Theorem 3.4.3 of [12]). The “norm” P,B derives from the refined version of Bernstein’s inequality, which was first used in the literature on rates of convergence of minimum contrast estimators in [1] (also see [11]).

Maximal inequalities of type (1.3) using uniform entropy are thus far unavailable. In this note we derive an exact parallel of (1.3) for uniformly bounded functions, and investigate similar inequalities for unbounded functions. The validity of these results seems unexpected, as the stronger control given by bracketing has often been thought necessary for estimates of moduli of continuity. It was suggested to us by Theorem 3.1 and its proof in [4].

1.1. Application to minimum contrast estimators

Inequalities involving the sizes of the functions f are of particular interest in the investigation of empirical minimum contrast estimators. Suppose that θ^n mimimizes a criterion of the type

θPnmθ,

for given measurable functions mθ:XR indexed by a parameter θ, and that the population contrast satisfies, for a “true” parameter θ0 and some metric d on the parameter set,

PmθPmθ0d2(θ,θ0).

A bound on the rate of convergence of θ^n to θ0 can then be derived from the modulus of continuity of the empirical process Gnmθ indexed by the functions mθ. Specifically (see e.g. [12], 3.2.5) if ϕn is a function such that δ ↦ ϕn(δ)/δα is decreasing for some α < 2 and

Esupθ:d(θ,θ0)<δGn(mθmθ0)ϕn(δ) (1.4)

then d(θ^n,θ0)=OP(δn), for δn any solution to

ϕn(δn)nδn2. (1.5)

Inequality (1.4) involves the empirical process indexed by the class of functions Mδ={mθmθ0:d(θ,θ0)<δ}. If d dominates the L2(P)-norm, or another norm ∥ ⋅ ∥ that can be used in an equality of the type (1.3), such as the Bernstein norm, and the norms of the envelopes of the classes Mδ are bounded in δ, then we can choose

ϕn(δ)=J(δ,Mδ,)(1+J(δ,Mdelta,)δ2n),

where J is an appropriate entropy integral. For this choice the inequality (1.5) is equivalent to

J(δn,Mδn,)nδn2. (1.6)

Thus a rate of convergence can be read off directly from the entropy integral.

We note that an inequality of type (1.3) is unattractive for very small δ, as the bound may even increase to infinity as δ ↓ 0. However, it is accurate for the range of δ that are important in the application to moduli of continuity.

Moduli of continuity also play an important role in model selection theorems. See for instance [7].

Inequalities involving uniform entropy permit for instance the immediate derivation of rates of convergence for minimum contrast functions that form VC-classes. Furthermore, uniform entropy is preserved under various (combinatorial) operations to make new classes of functions. This makes uniform entropy integrals a useful tool in situations where bracketing numbers may be difficult to handle. Equation (1.6) gives an elegant characterization of rates of convergence in these situations, where thus far ad-hoc arguments were necessary.

2. Uniformly bounded classes

Call the class F of functions P-measurable if the map

(X1,,Xn)supfFi=1neif(Xi)

on the completion of the probability space (Xn,An,Pn) is measurable, for every sequence e1, e2, … , en ∈ {−1, 1}.

Theorem 2.1. Let F be a P-measurable class of measurable functions with envelope function F ≤ 1 and such that F2 is P-measurable. If Pf2 < δ2PF2, for every f and some δ ∈ (0, 1), then

EPGnFJ(δ,F,L2)(1+J(δ,F,L2)δ2nFP,2)FP,2.

Proof. We use the following refinement of (1.1) (see e.g. [12], 2.14.1): for any P-measurable class F,

EPGnFEPJ(supf(Pnf2)12(PnF2)12,F,L2)(PnF2)12. (2.1)

Because δJ(δ,F,L2) is the integral of a nonincreasing nonnegative function, it is a concave function such that the map tJ(t)/t, which is the average of its derivative over [0, t], is nonincreasing. The concavity shows that its perspective (x,t)tJ(xt,F,L2) is a concave function of its two arguments (cf. [2], page 89). Furthermore, the “extended-value extension” of this function (which by definition is −∞ if x ≤ 0 or t ≤ 0) is obviously nondecreasing in its first argument and was noted to be nondecreasing in its second argument. Therefore, by the vector composition rules for concave functions ([2], pages 83–87, especially lines -2 and -1 of page 86), the function (x,y)H(x,y)J(xy,F,L2)y is concave. We have that EPPnF2=FP,22. Therefore, by an application of Jensen’s inequality to the right side of the preceding display we obtain, for σn2=supfPnf2,

EPGnFJ(EPσn2FP,2,F,L2)FP,2. (2.2)

The application of Jensen’s inequality with outer expectations can be justified here by the monotonicity of the function H, which shows that the measurable majorant of a variable H(U, V) is bounded above by H(U*, V*), for U* and V* measurable majorants of U and V. Thus E*H(U, V) ≤ EH(U*, V*), after which Jensen’s inequality can be applied in its usual (measurable) form.

The second step of the proof is to bound EPσn2. Because Pnf2=Pf2+n12Gnf2 and Pf2δ2PF2 for every f, we have

EPσn2δ2FP,22+1nEPGnF2. (2.3)

Here the empirical process in the second term can be replaced by the symmetrized empirical process Gno (defined as Gnof=n12i=1nεif(Xi) for independent Rademacher variables ε1, ε2, … , εn) at the cost of adding a multiplicative factor 2 (e.g. [12], 2.3.1). The expectation can be factorized as the expectation on the Rademacher variables ε followed by the expectation on X1, …, Xn, and EεGnoF22EεGnoF by the contraction principle for Rademacher variables ([6], Theorem 4.12), and the fact that F ≤ 1 by assumption. Taking the expectation on X1, … , Xn, we obtain that EPGnF24EPGnoF, which in turn is bounded above by 8EPGnF by the desymmetrization inequality (e.g. 2.36 in [12]).

Thus F2 in the last term of (2.3) can be replaced by F, at the cost of inserting a constant. Next we apply (2.2) to this term, and conclude that z2EPσn2FP,22 satisfies the inequality

z2δ2+J(z,F,L2)nFP,2. (2.4)

We apply Lemma 2.1 with r = 1, A = δ and B2=1(nFP,2) to see that

J(z,F,L2)J(δ,F,L2)+J2(δ,F,L2)δ2nFP,2.

We insert this in (2.2) to complete the Proof.

Lemma 2.1. Let J:(0,)R be a concave, nondecreasing function with J(0) = 0. If z2A2 + B2J(zr) for some r ∈ (0, 2) and A,B > 0, then

J(z)J(A)[1+J(Ar)(BA)2]1(2r).

Proof. For t > s > 0 we can write s as the convex combination s = (s/t)t+ (1−s/t)0 of t and 0. Since J(0) = 0, the concavity of J gives that J(s) ≥ (s/t)J(t). Thus the function tJ(t)/t is decreasing, which implies that J(Ct) ≤ CJ(t) for C ≥ 1 and any t > 0.

By the monotonicity of J and the assumption on z it follows that

J(zr)J((A2+B2J(zr))r2)J(Ar)(1+(BA)2J(zr))r2.

This implies that J(zr) is bounded by a multiple of the maximum of J(Ar) and J(Ar)(B/A)rJ(zr)r/2. If it is bounded by the second one, then J(zr)1r2J(Ar)(BA)r. We conclude that

J(zr)J(Ar)2(2r)(BA)2r(2r).

Next again by the monotonicity of J,

J(z)J(A2+B2J(zr))J(A)1+(BA)2J(zr)J(A)[1+(BA)2(J(Ar)+J(Ar)2(2r)(BA)2r(2r))]12J(A)[1+J(Ar)(BA)+(BA)1(2r)].

The middle term on the right side is bounded by a multiple of the sum of the first and third terms, since x1p+xq for any conjugate pair (p, q) and any x > 0, in particular x=J(Ar)BA.

For values of δ such that δFP,21n Theorem 2.1 can be improved. (This seems not to be of prime interest for statistical applications.) Its bound can be written in the form J(δ,F,L2)FP,2+J2(δ,F,L2)(δ2n). In the second term δ can be replaced by 1(FP,2n), which is better if δ is smaller than the latter number, as the function δJ(δ,F,L2)δ is decreasing.

Lemma 2.2. Under the conditions of Theorem 2.1,

EPGnFJ(δ,F,L2)FP,2+J2(1nFP,2,F,L2)nFP,22.

Proof. We follow the proof of Theorem 2.1 up to (2.4), but next use the alternative bounds

J(z,F,L2)J(δ2+J(z,F,L2)nFP,2,F,L2)J(δ,F,L2)+J(J(z,F,L2)nFP,2,F,L2)J(δ,F,L2)+J(δn,F,L2)J(z,F,L2)δn1,

for 1δn=nFP,2. Here we have used the subadditivity of the map δJ(δ,F,L2), and the inequality J(Cδ,F,l2)CJ(δ,F,L2) for C ≥ 1 in the last step. We can bound the sum of the three terms on the right side by a multiple of the maximum of these terms and conclude that the left side is smaller than at least one of the three terms. Solving next yields that

J(z,F,L2)J(z,F,L2)J2(δn,F,L2)δnJ(δn,F,L2).

Because J(δn,F,L2)δn for every δn > 0, by the definition of the entropy integral, the third term on the right is bounded by the second term. We substitute the bound in (2.2) to finish the proof.

3. Unbounded classes

In this section we investigate relaxations of the assumption that the class F of functions is uniformly bounded, made in Theorem 2.1. We start with a moment bound on the envelope.

Theorem 3.1. Let F be a P-measurable class of measurable functions with envelope function F such that PF(4p−2)/(p−1) < ∞ for some p > 1 and such that F2 and F4 are P-measurable. If Pf2 < δ2PF2 for every f and some δ ∈ (0, 1), then

EPGnFJ(δ,F,L2)(1+J(δ1p,F,L2)δ2nFP,(4p2)(p1)21pFP,221p)p(2p1)FP,2.

Proof. Application of (2.1) to the functions f2, forming the class F2 with envelope function F2, yields

EPGnF2EPJ(σn,42(PnF4)12,F2F2,L2)(PnF4)12, (3.1)

for σn,r the diameter of F in Lr(Pn), i.e.

σn,rr=supfPnfr. (3.2)

Preservation properties of uniform entropy (see [10], or [12], 2.10.20, where the supremum over Q can also be moved outside the integral to match our current definition of entropy integral, applied to ϕ(f) = f2 with L = 2F) show that J(δ,F2F2,L2)J(δ,FF,L2), for every δ > 0. Because Pnf2=Pf2+n12Gnf2 and Pf2δ2PF2 by assumption, we fing that

EPσn,22δ2PF2+1nEPJ(σn,42(PnF4)12,F,L2)(PnF4)12. (3.3)

The next step is to bound σn,4 in terms of σn,2.

By Hölder’s inequality, for any conjugate pair (p, q) and any 0 < s < 4,

Pnf4Pnf4sFs(Pnf(4s)p)1p(PnFsq)1q.

Choosing s such that (4 − s)p = 2 (and hence sq = (4p − 2)/(p − 1)), we find that

σn,44σn,22p(PnFsq)1q.

We insert this bound in (3.3). The function (x, y) ↦ x1/py1/q is concave, and hence the function (x,y,z)J(x1py1qz,F,L2)z can be seen to be concave by the same arguments as in the proof of Theorem 2.1. Therefore, we can apply Jensen’s inequality to see that

EPσn,22δ2PF2+1nJ((EPσn,22)1(2p)(PFsq)1(2q)(PF4)12,F,L2)(PF4)12.

We conclude that z(EPσn,22)12FP,2 satisfies

z2δ2+1nJ(z1p(PF2)1(2p)(PFsq)1(2q)(PF4)12,F,L2)(PF4)12PF2δ2+J(z1p,F,L2)(PFsq)1(2q)n(PF2)11(2p).

In the last step we use that J(Cδ,F,L2)CJ(δ,F,L2) for C ≥ 1, and Hölder’s inequality as previously to see that the present C satisfies this condition. We next apply Lemma 2.1 (with r = 1/p) to obtain a bound on J(z,F,L2), and conclude the proof by substituting this bound in (2.2).

The preceding theorem assumes only a finite moment of the envelope function, but in comparison to Theorem 2.1 substitutes J(δ1p,F,L2) in the correction term of the upper bound, where p > 1 and hence δ1/pδ for small δ. In applications to moduli of continuity of minimum contrast criteria this is sufficient to obtain consistency with a rate, but typically the rate will be suboptimal. The rate improves as p ↓ 1, which requires finite moments of the envelope function of order increasing to infinity, the limiting case p = 1 corresponding to a bounded envelope, as in Theorem 2.1. The following theorem interpolates between finite moments of any order and a bounded envelope function. If applied to obtaining rates of convergence it gives rates that are optimal up to a logarithmic factor.

Theorem 3.2. Let F be a P-measurable class of measurable functions with envelope function F such that P exp(Fp+ρ) < ∞ for some p, ρ > 0 and such that F2 and F4 are P-measurable. If Pf2 < δ2PF2 for every f and some δ ∈ (0, 1/2), then for a constant c dependent on p, PF2, PF4 and P exp(Fp+ρ),

EPGnFcJ(δ,F,L2)(1+J(δ(log(1δ))1p,F,L2)δ2n).

Proof. Fix r = 2/p. The functions ψ, ψ¯:[0,)[0,) defined by

ψ(f)=logr(1+f),ψ¯(f)=ef1r1,

are each other’s inverses, and are increasing from ψ(0)=ψ¯(0)=0 to infinity. Thus their primitive functions Ψ(f)=0fψ(s)ds and Ψ¯(f)=0fψ¯(s) satisfy Young’s inequality fgΨ(f)+Ψ¯(g), for every f, g ≥ 0 (e.g. [2], page 120, 3.38).

The function tt logr(1/t) is concave in a neighbourhood of 0 (specifically: on the interval (0, e1−r ∧ 1)), with limit from the right equal to 0 at 0, and derivative tending to infinity at this point. Therefore, there exists a concave, increasing function k: (0, ∞) → (0, ∞) that is identical to tt logr(1/t) near 0 and bounded below and above by a positive constant times the identity throughout its domain. (E.g. extend tt logr(1/t) linearly with slope 1 from the point where the derivative of the latter function has decreased to 1.) Write k(t) = tlr(t), so that lr is bounded below by a constant and l(t) = log(1/t) near 0. Then, for every t > 0,

log(2+tC)l(C)log(2+t). (3.4)

(The constant in ≲ may depend on r.) To see this, note that for C > c the left side is bounded by a multiple of log(2 + t/c), whereas for small C the left side is bounded by a multiple of [log(2+t)+log(1+1C)]l(C)log(2+t)+1.

From the inequality Ψ(f) ≤ fψ(f), we obtain that, for f > 0,

Ψ(flogr(2+f))f.

Therefore, by (3.4) followed by Young’s inequality,

f4k(C2)=f2C2logr(2+f2C2)f2logr(2+f2C2)lr(C2)f2C2+Ψ¯(F2logr(2+F2)).

On integrating this with respect to the empirical measure, with C2=Pnf2, we, see that, with G=Ψ¯(F2logr(2+F2)),

Pnf4k(Pnf2)(1+PnG).

We take the supremum over f to bound σn,44 as in (3.2) in terms of k(σn,22), and next substitute this bound in (3.3) to find that

EPσn,22δ2PF2+1nEPJ(k(σn,22)1+PnG(PnF4)12,F,L2)(PnF4)12δ2PF2+1nJ(k(EPσn,22)1+PG(PF4)12,F,L2)(PF4)12,

where we have used the concavity of k, and the concavity of the other maps, as previously. By assumption the expected value PG is finite for r = 2/p. It follows that z2=EPσn,22PF2 satisfies, for suitable constants a, b, c depending on r, PF2, PF4 and PG,

z2δ2+anJ(k(z2b)c,F,L2).

By concavity and the fact that k(0) = 0, we have k(Cz) ≤ Ck(z), for C ≥ 1 and z > 0. The function zk(z2b)c inherits this property. Therefore we can apply Lemma 3.1, with k of the lemma equal to the present function zk(z2b)c, to obtain a bound on J(z,F,L2) in terms of J(δ,F,L2) and J(k(δ2b)c,F,L2), which we substitute in (2.2). Here k(δ2) = δ2logr(1/δ) for suffciently small δ > 0 and k(δ2)δ2δ2logr(1δ) for δ < 1/2 and bounded away from 0. Thus we can simplify the bound to the one in the statement of the theorem, possibly after increasing the constants a, b, c to be at least 1, to complete the proof.

Lemma 3.1. Let J:(0,)R be a concave, nondecreasing function with J(0) = 0, and let k: (0, ∞) → (0, ∞) be nondecreasing and satisfy k(Cz) ≤ Ck(z) for C ≥ 1 and z > 0. If z2A2+B2 J(k(z)) for some A, B > 0, then

J(z)J(A)[1+J(k(A))(BA)2].

Proof. As noted in the proof of Lemma 2.1 the properties of J imply that J(Cz) ≤ CJ(z) for C ≥ 1 and any z > 0. In view of the assumed property of k and the monotonicity of J it follows that Jοk(Cz)CJοk(z) for every C ≥ 1 and z > 0. Therefore, by the monotonicity of J and k, and the assumption on z,

Jοk(z)Jοk(A2+B2Jοk(z))Jοk(A)1+(BA)2Jοk(z).

As in the proof of Lemma 2.1 we can solve this for J ο k(z) to find that

Jοk(z)Jοk(A)+Jοk(A)2(BA)2.

Next again by the monotonicity of J,

J(z)J(A2+B2Jοk(z))J(A)1+(BA)2Jοk(z)J(A)[1+(BA)Jοk(A)+(BA)2Jοk(A)].

The middle term on the right side is bounded by the sum of the first and third terms.

Footnotes

AMS 2000 subject classifications: 60K35.

Contributor Information

Aad van der Vaart, Department of Mathematics, Faculty of Sciences, Vrije Universiteit De Boelelaan 1081a, 1081 HV Amsterdam, aad@cs.vu.nl.

Jon A. Wellner, Department of Statistics, University of Washington, Seattle, WA 98195-4322, jaw@stat.washington.edu.

References

  • [1].Birgé L, Massart P. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields. 1993;97(1-2):113–150. MR1240719. [Google Scholar]
  • [2].Boyd S, Vandenberghe L. Convex optimization. Cambridge University Press; Cambridge: 2004. MR2061575. [Google Scholar]
  • [3].Dudley RM. Central limit theorems for empirical measures. Ann. Probab. 1979;6(6):899–929. 1978. MR0512411. [Google Scholar]
  • [4].Giné E, Koltchinskii V. Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 2006;34(3):1143–1216. MR2243881. [Google Scholar]
  • [5].Kolchins’kiĭ VĪ. On the central limit theorem for empirical measures. Teor. Veroyatnost. i Mat. Statist. 1981;24:63–75. 152. MR0628431. [Google Scholar]
  • [6].Ledoux M, Talagrand M. Probability in Banach spaces, vol. 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] Springer-Verlag; Berlin: 1991. Isoperimetry and processes. MR1102015. [Google Scholar]
  • [7].Massart P, Nédélec É . Risk bounds for statistical learning. Ann. Statist. 2006;34(5):2326–2366. MR2291502. [Google Scholar]
  • [8].Ossiander M. A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 1987;15(3):897–919. MR0893905. [Google Scholar]
  • [9].Pollard D. A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A. 1982;33(2):235–248. MR0668445. [Google Scholar]
  • [10].Pollard D. Empirical processes: theory and applications; NSF-CBMS Regional Conference Series in Probability and Statistics, 2; Institute of Mathematical Statistics, Hayward, CA. 1990; MR1089429. [Google Scholar]
  • [11].van de Geer S. The method of sieves and minimum contrast estimators. Math. Methods Statist. 1995;4(1):20–38. MR1324688. [Google Scholar]
  • [12].van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. MR1385671. [Google Scholar]

RESOURCES