Abstract
We derive an upper bound for the mean of the supremum of the empirical process indexed by a class of functions that are known to have variance bounded by a small constant δ. The bound is expressed in the uniform entropy integral of the class at δ. The bound yields a rate of convergence of minimum contrast estimators when applied to the modulus of continuity of the contrast functions.
Keywords: Empirical process, modulus of continuity, minimum contrast estimator, rate of convergence
1. Introduction
The empirical measure and empirical process of a sample of observations X1, … , Xn from a probability measure P on a measurable space () attach to a given measurable function the numbers
It is often useful to study the suprema of these stochastic processes over a given class of measurable functions. The distribution of the supremum
is known to concentrate near its mean value, at a rate depending on the size of the envelope function of the class , but irrespective of its complexity. On the other hand, the mean value of depends on the size of the class . Entropy integrals, of which there are two basic versions, are useful tools to bound this mean value.
The uniform entropy integral was introduced in [9] and [5], following [3], in their study of the abstract version of Donsker’s theorem. We define an Lr-version of it as
Here the supremum is taken over all finitely discrete probability distributions Q on (), the covering number is the minimal number of balls of radius ε in Lr(Q) needed to cover , F is an envelope function of , and denotes the norm of a function f in Lr(Q). The integral is defined relative to an envelope function, which need not be the minimal one, but can be any measurable function such that |f| ≤ F for every . If multiple envelope functions are under consideration, then we write to stress this dependence. An inequality, due to Pollard (also see [12], 2.14.1), says, under some measurability assumptions, that
(1.1) |
Here ≲ means smaller than up to a universal constant. This shows that for a class with finite uniform entropy integral, the supremum is not essentially bigger than a multiple of the empirical process at the envelope function F. The inequality is particularly useful if this envelope function is small.
The bracketing entropy integral has its roots in the Donsker theorem of [8], again following initial work by Dudley. For a given norm it can be defined as
Here the bracketing number is the minimal number of brackets of size smaller than ε needed to cover . A useful inequality, due to Pollard (also see [12], 2.14.2), is
(1.2) |
Bracketing numbers are bigger than covering numbers (at twice the size), and hence the bracketing integral is bigger than a multiple of the corresponding entropy integral. However, the bracketing integral involves only the single distribution P, whereas the uniform entropy integral takes a supremum over all (discrete) distributions, making the two integrals incomparable in general. Apart from this difference the two maximal inequalities have the same message.
The two inequalities (1.1) and (1.2) involve the size of the envelope function, but not the sizes of the individual functions in the class . They also exploit finiteness of the entropy integrals only, roughly requiring that the entropy grows at smaller order than ε−2 as ε ↓ 0, and not the precise size of the entropy. In the case of the bracketing integral this is remedied in the equality (see [12], 3.4.2), valid for any class of functions with Pf2 ≤ δ2PF2 and any δ ∈ (0, 1),
(1.3) |
Here the assumption that the class of functions is uniformly bounded is too restrictive for some applications, but can be removed if the entropy integral is computed relative to the stronger “norm”
Although it is not a norm, this quantity can be used to define the size of brackets and hence bracketing numbers. Inequality (1.3) is valid for an arbitrary class of functions with if the L2(P)-norm is replaced by in its right side (at four appearances) (see Theorem 3.4.3 of [12]). The “norm” derives from the refined version of Bernstein’s inequality, which was first used in the literature on rates of convergence of minimum contrast estimators in [1] (also see [11]).
Maximal inequalities of type (1.3) using uniform entropy are thus far unavailable. In this note we derive an exact parallel of (1.3) for uniformly bounded functions, and investigate similar inequalities for unbounded functions. The validity of these results seems unexpected, as the stronger control given by bracketing has often been thought necessary for estimates of moduli of continuity. It was suggested to us by Theorem 3.1 and its proof in [4].
1.1. Application to minimum contrast estimators
Inequalities involving the sizes of the functions f are of particular interest in the investigation of empirical minimum contrast estimators. Suppose that mimimizes a criterion of the type
for given measurable functions indexed by a parameter θ, and that the population contrast satisfies, for a “true” parameter θ0 and some metric d on the parameter set,
A bound on the rate of convergence of to θ0 can then be derived from the modulus of continuity of the empirical process indexed by the functions mθ. Specifically (see e.g. [12], 3.2.5) if ϕn is a function such that δ ↦ ϕn(δ)/δα is decreasing for some α < 2 and
(1.4) |
then , for δn any solution to
(1.5) |
Inequality (1.4) involves the empirical process indexed by the class of functions . If d dominates the L2(P)-norm, or another norm ∥ ⋅ ∥ that can be used in an equality of the type (1.3), such as the Bernstein norm, and the norms of the envelopes of the classes Mδ are bounded in δ, then we can choose
where J is an appropriate entropy integral. For this choice the inequality (1.5) is equivalent to
(1.6) |
Thus a rate of convergence can be read off directly from the entropy integral.
We note that an inequality of type (1.3) is unattractive for very small δ, as the bound may even increase to infinity as δ ↓ 0. However, it is accurate for the range of δ that are important in the application to moduli of continuity.
Moduli of continuity also play an important role in model selection theorems. See for instance [7].
Inequalities involving uniform entropy permit for instance the immediate derivation of rates of convergence for minimum contrast functions that form VC-classes. Furthermore, uniform entropy is preserved under various (combinatorial) operations to make new classes of functions. This makes uniform entropy integrals a useful tool in situations where bracketing numbers may be difficult to handle. Equation (1.6) gives an elegant characterization of rates of convergence in these situations, where thus far ad-hoc arguments were necessary.
2. Uniformly bounded classes
Call the class of functions P-measurable if the map
on the completion of the probability space () is measurable, for every sequence e1, e2, … , en ∈ {−1, 1}.
Theorem 2.1. Let be a P-measurable class of measurable functions with envelope function F ≤ 1 and such that is P-measurable. If Pf2 < δ2PF2, for every f and some δ ∈ (0, 1), then
Proof. We use the following refinement of (1.1) (see e.g. [12], 2.14.1): for any P-measurable class ,
(2.1) |
Because is the integral of a nonincreasing nonnegative function, it is a concave function such that the map t ↦ J(t)/t, which is the average of its derivative over [0, t], is nonincreasing. The concavity shows that its perspective is a concave function of its two arguments (cf. [2], page 89). Furthermore, the “extended-value extension” of this function (which by definition is −∞ if x ≤ 0 or t ≤ 0) is obviously nondecreasing in its first argument and was noted to be nondecreasing in its second argument. Therefore, by the vector composition rules for concave functions ([2], pages 83–87, especially lines -2 and -1 of page 86), the function is concave. We have that . Therefore, by an application of Jensen’s inequality to the right side of the preceding display we obtain, for ,
(2.2) |
The application of Jensen’s inequality with outer expectations can be justified here by the monotonicity of the function H, which shows that the measurable majorant of a variable H(U, V) is bounded above by H(U*, V*), for U* and V* measurable majorants of U and V. Thus E*H(U, V) ≤ EH(U*, V*), after which Jensen’s inequality can be applied in its usual (measurable) form.
The second step of the proof is to bound . Because and Pf2 ≤ δ2PF2 for every f, we have
(2.3) |
Here the empirical process in the second term can be replaced by the symmetrized empirical process (defined as for independent Rademacher variables ε1, ε2, … , εn) at the cost of adding a multiplicative factor 2 (e.g. [12], 2.3.1). The expectation can be factorized as the expectation on the Rademacher variables ε followed by the expectation on X1, …, Xn, and by the contraction principle for Rademacher variables ([6], Theorem 4.12), and the fact that F ≤ 1 by assumption. Taking the expectation on X1, … , Xn, we obtain that , which in turn is bounded above by by the desymmetrization inequality (e.g. 2.36 in [12]).
Thus in the last term of (2.3) can be replaced by , at the cost of inserting a constant. Next we apply (2.2) to this term, and conclude that satisfies the inequality
(2.4) |
We apply Lemma 2.1 with r = 1, A = δ and to see that
We insert this in (2.2) to complete the Proof.
Lemma 2.1. Let be a concave, nondecreasing function with J(0) = 0. If z2 ≤ A2 + B2J(zr) for some r ∈ (0, 2) and A,B > 0, then
Proof. For t > s > 0 we can write s as the convex combination s = (s/t)t+ (1−s/t)0 of t and 0. Since J(0) = 0, the concavity of J gives that J(s) ≥ (s/t)J(t). Thus the function t ↦ J(t)/t is decreasing, which implies that J(Ct) ≤ CJ(t) for C ≥ 1 and any t > 0.
By the monotonicity of J and the assumption on z it follows that
This implies that J(zr) is bounded by a multiple of the maximum of J(Ar) and J(Ar)(B/A)rJ(zr)r/2. If it is bounded by the second one, then . We conclude that
Next again by the monotonicity of J,
The middle term on the right side is bounded by a multiple of the sum of the first and third terms, since for any conjugate pair (p, q) and any x > 0, in particular .
For values of δ such that Theorem 2.1 can be improved. (This seems not to be of prime interest for statistical applications.) Its bound can be written in the form . In the second term δ can be replaced by , which is better if δ is smaller than the latter number, as the function is decreasing.
Lemma 2.2. Under the conditions of Theorem 2.1,
Proof. We follow the proof of Theorem 2.1 up to (2.4), but next use the alternative bounds
for . Here we have used the subadditivity of the map , and the inequality for C ≥ 1 in the last step. We can bound the sum of the three terms on the right side by a multiple of the maximum of these terms and conclude that the left side is smaller than at least one of the three terms. Solving next yields that
Because for every δn > 0, by the definition of the entropy integral, the third term on the right is bounded by the second term. We substitute the bound in (2.2) to finish the proof.
3. Unbounded classes
In this section we investigate relaxations of the assumption that the class of functions is uniformly bounded, made in Theorem 2.1. We start with a moment bound on the envelope.
Theorem 3.1. Let be a P-measurable class of measurable functions with envelope function F such that PF(4p−2)/(p−1) < ∞ for some p > 1 and such that and are P-measurable. If Pf2 < δ2PF2 for every f and some δ ∈ (0, 1), then
Proof. Application of (2.1) to the functions f2, forming the class with envelope function F2, yields
(3.1) |
for σn,r the diameter of in , i.e.
(3.2) |
Preservation properties of uniform entropy (see [10], or [12], 2.10.20, where the supremum over Q can also be moved outside the integral to match our current definition of entropy integral, applied to ϕ(f) = f2 with L = 2F) show that , for every δ > 0. Because and Pf2 ≤ δ2PF2 by assumption, we fing that
(3.3) |
The next step is to bound σn,4 in terms of σn,2.
By Hölder’s inequality, for any conjugate pair (p, q) and any 0 < s < 4,
Choosing s such that (4 − s)p = 2 (and hence sq = (4p − 2)/(p − 1)), we find that
We insert this bound in (3.3). The function (x, y) ↦ x1/py1/q is concave, and hence the function can be seen to be concave by the same arguments as in the proof of Theorem 2.1. Therefore, we can apply Jensen’s inequality to see that
We conclude that satisfies
In the last step we use that for C ≥ 1, and Hölder’s inequality as previously to see that the present C satisfies this condition. We next apply Lemma 2.1 (with r = 1/p) to obtain a bound on , and conclude the proof by substituting this bound in (2.2).
The preceding theorem assumes only a finite moment of the envelope function, but in comparison to Theorem 2.1 substitutes in the correction term of the upper bound, where p > 1 and hence δ1/p ≫ δ for small δ. In applications to moduli of continuity of minimum contrast criteria this is sufficient to obtain consistency with a rate, but typically the rate will be suboptimal. The rate improves as p ↓ 1, which requires finite moments of the envelope function of order increasing to infinity, the limiting case p = 1 corresponding to a bounded envelope, as in Theorem 2.1. The following theorem interpolates between finite moments of any order and a bounded envelope function. If applied to obtaining rates of convergence it gives rates that are optimal up to a logarithmic factor.
Theorem 3.2. Let be a P-measurable class of measurable functions with envelope function F such that P exp(Fp+ρ) < ∞ for some p, ρ > 0 and such that and are P-measurable. If Pf2 < δ2PF2 for every f and some δ ∈ (0, 1/2), then for a constant c dependent on p, PF2, PF4 and P exp(Fp+ρ),
Proof. Fix r = 2/p. The functions ψ, defined by
are each other’s inverses, and are increasing from to infinity. Thus their primitive functions and satisfy Young’s inequality , for every f, g ≥ 0 (e.g. [2], page 120, 3.38).
The function t ↦ t logr(1/t) is concave in a neighbourhood of 0 (specifically: on the interval (0, e1−r ∧ 1)), with limit from the right equal to 0 at 0, and derivative tending to infinity at this point. Therefore, there exists a concave, increasing function k: (0, ∞) → (0, ∞) that is identical to t ↦ t logr(1/t) near 0 and bounded below and above by a positive constant times the identity throughout its domain. (E.g. extend t ↦ t logr(1/t) linearly with slope 1 from the point where the derivative of the latter function has decreased to 1.) Write k(t) = tlr(t), so that lr is bounded below by a constant and l(t) = log(1/t) near 0. Then, for every t > 0,
(3.4) |
(The constant in ≲ may depend on r.) To see this, note that for C > c the left side is bounded by a multiple of log(2 + t/c), whereas for small C the left side is bounded by a multiple of .
From the inequality Ψ(f) ≤ fψ(f), we obtain that, for f > 0,
Therefore, by (3.4) followed by Young’s inequality,
On integrating this with respect to the empirical measure, with , we, see that, with ,
We take the supremum over f to bound as in (3.2) in terms of , and next substitute this bound in (3.3) to find that
where we have used the concavity of k, and the concavity of the other maps, as previously. By assumption the expected value PG is finite for r = 2/p. It follows that satisfies, for suitable constants a, b, c depending on r, PF2, PF4 and PG,
By concavity and the fact that k(0) = 0, we have k(Cz) ≤ Ck(z), for C ≥ 1 and z > 0. The function inherits this property. Therefore we can apply Lemma 3.1, with k of the lemma equal to the present function , to obtain a bound on in terms of and , which we substitute in (2.2). Here k(δ2) = δ2logr(1/δ) for suffciently small δ > 0 and for δ < 1/2 and bounded away from 0. Thus we can simplify the bound to the one in the statement of the theorem, possibly after increasing the constants a, b, c to be at least 1, to complete the proof.
Lemma 3.1. Let be a concave, nondecreasing function with J(0) = 0, and let k: (0, ∞) → (0, ∞) be nondecreasing and satisfy k(Cz) ≤ Ck(z) for C ≥ 1 and z > 0. If z2 ≤ A2+B2 J(k(z)) for some A, B > 0, then
Proof. As noted in the proof of Lemma 2.1 the properties of J imply that J(Cz) ≤ CJ(z) for C ≥ 1 and any z > 0. In view of the assumed property of k and the monotonicity of J it follows that for every C ≥ 1 and z > 0. Therefore, by the monotonicity of J and k, and the assumption on z,
As in the proof of Lemma 2.1 we can solve this for J ο k(z) to find that
Next again by the monotonicity of J,
The middle term on the right side is bounded by the sum of the first and third terms.
Footnotes
AMS 2000 subject classifications: 60K35.
Contributor Information
Aad van der Vaart, Department of Mathematics, Faculty of Sciences, Vrije Universiteit De Boelelaan 1081a, 1081 HV Amsterdam, aad@cs.vu.nl.
Jon A. Wellner, Department of Statistics, University of Washington, Seattle, WA 98195-4322, jaw@stat.washington.edu.
References
- [1].Birgé L, Massart P. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields. 1993;97(1-2):113–150. MR1240719. [Google Scholar]
- [2].Boyd S, Vandenberghe L. Convex optimization. Cambridge University Press; Cambridge: 2004. MR2061575. [Google Scholar]
- [3].Dudley RM. Central limit theorems for empirical measures. Ann. Probab. 1979;6(6):899–929. 1978. MR0512411. [Google Scholar]
- [4].Giné E, Koltchinskii V. Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 2006;34(3):1143–1216. MR2243881. [Google Scholar]
- [5].Kolchins’kiĭ VĪ. On the central limit theorem for empirical measures. Teor. Veroyatnost. i Mat. Statist. 1981;24:63–75. 152. MR0628431. [Google Scholar]
- [6].Ledoux M, Talagrand M. Probability in Banach spaces, vol. 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] Springer-Verlag; Berlin: 1991. Isoperimetry and processes. MR1102015. [Google Scholar]
- [7].Massart P, Nédélec É . Risk bounds for statistical learning. Ann. Statist. 2006;34(5):2326–2366. MR2291502. [Google Scholar]
- [8].Ossiander M. A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 1987;15(3):897–919. MR0893905. [Google Scholar]
- [9].Pollard D. A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A. 1982;33(2):235–248. MR0668445. [Google Scholar]
- [10].Pollard D. Empirical processes: theory and applications; NSF-CBMS Regional Conference Series in Probability and Statistics, 2; Institute of Mathematical Statistics, Hayward, CA. 1990; MR1089429. [Google Scholar]
- [11].van de Geer S. The method of sieves and minimum contrast estimators. Math. Methods Statist. 1995;4(1):20–38. MR1324688. [Google Scholar]
- [12].van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. MR1385671. [Google Scholar]