A local maximal inequality under uniform entropy

Aad van der Vaart; Jon A Wellner

doi:10.1214/11-EJS605

. Author manuscript; available in PMC: 2012 Mar 13.

Published in final edited form as: Electron J Stat. 2011;5(2011):192–203. doi: 10.1214/11-EJS605

A local maximal inequality under uniform entropy

Aad van der Vaart ¹, Jon A Wellner ^2,^*

PMCID: PMC3299941 NIHMSID: NIHMS359099 PMID: 22423315

Abstract

We derive an upper bound for the mean of the supremum of the empirical process indexed by a class of functions that are known to have variance bounded by a small constant δ. The bound is expressed in the uniform entropy integral of the class at δ. The bound yields a rate of convergence of minimum contrast estimators when applied to the modulus of continuity of the contrast functions.

Keywords: Empirical process, modulus of continuity, minimum contrast estimator, rate of convergence

1. Introduction

The empirical measure $P_{n}$ and empirical process $G_{n}$ of a sample of observations X₁, … , X_n from a probability measure P on a measurable space ( $X, A$ ) attach to a given measurable function $f : X \to R$ the numbers

P_{n} f = \frac{1}{n} \sum_{i = 1}^{n} f (X_{i}), G_{n} f = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (f (X_{i}) - P f) .

It is often useful to study the suprema of these stochastic processes over a given class $F$ of measurable functions. The distribution of the supremum

{∥ G_{n} ∥}_{F^{:}} = sup_{f \in F} ∣ n f ∣

is known to concentrate near its mean value, at a rate depending on the size of the envelope function of the class $F$ , but irrespective of its complexity. On the other hand, the mean value of ${∥ G_{n} ∥}_{F}$ depends on the size of the class $F$ . Entropy integrals, of which there are two basic versions, are useful tools to bound this mean value.

The uniform entropy integral was introduced in [9] and [5], following [3], in their study of the abstract version of Donsker’s theorem. We define an L_r-version of it as

J (δ, F, L_{r}) = sup_{Q} \int_{0}^{δ} \sqrt{1 + \log N (ε {∥ F ∥}_{Q, r}, F, L_{r} (Q))} d ε .

Here the supremum is taken over all finitely discrete probability distributions Q on ( $X, A$ ), the covering number $N (ε, F, L_{r} (Q))$ is the minimal number of balls of radius ε in L_r(Q) needed to cover $F$ , F is an envelope function of $F$ , and ${∥ f ∥}_{Q, r}$ denotes the norm of a function f in L_r(Q). The integral is defined relative to an envelope function, which need not be the minimal one, but can be any measurable function $F : X \to R$ such that |f| ≤ F for every $f \in F$ . If multiple envelope functions are under consideration, then we write $J (δ, F ∣ F, L_{r})$ to stress this dependence. An inequality, due to Pollard (also see [12], 2.14.1), says, under some measurability assumptions, that

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J (1, F, L_{2}) {∥ F ∥}_{P, 2} .

(1.1)

Here ≲ means smaller than up to a universal constant. This shows that for a class $F$ with finite uniform entropy integral, the supremum ${∥ G_{n} ∥}_{F}$ is not essentially bigger than a multiple of the empirical process $G_{n} F$ at the envelope function F. The inequality is particularly useful if this envelope function is small.

The bracketing entropy integral has its roots in the Donsker theorem of [8], again following initial work by Dudley. For a given norm it can be defined as

J_{[]} (δ, F, ∥ \cdot ∥) = \int_{0}^{δ} \sqrt{1 + \log N_{[]} (ε ∥ F ∥, F, ∥ \cdot ∥)} d ε .

Here the bracketing number $N_{[]} (ε, F, ∥ \cdot ∥)$ is the minimal number of brackets $[l, u] = {f : X \to R : l \leq f \leq u}$ of size $∥ u - l ∥$ smaller than ε needed to cover $F$ . A useful inequality, due to Pollard (also see [12], 2.14.2), is

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J_{[]} (1, F, L_{2} (P)) {∥ F ∥}_{P, 2} .

(1.2)

Bracketing numbers are bigger than covering numbers (at twice the size), and hence the bracketing integral is bigger than a multiple of the corresponding entropy integral. However, the bracketing integral involves only the single distribution P, whereas the uniform entropy integral takes a supremum over all (discrete) distributions, making the two integrals incomparable in general. Apart from this difference the two maximal inequalities have the same message.

The two inequalities (1.1) and (1.2) involve the size of the envelope function, but not the sizes of the individual functions in the class $F$ . They also exploit finiteness of the entropy integrals only, roughly requiring that the entropy grows at smaller order than ε⁻² as ε ↓ 0, and not the precise size of the entropy. In the case of the bracketing integral this is remedied in the equality (see [12], 3.4.2), valid for any class of functions $f : X \to [- 1, 1]$ with Pf² ≤ δ²PF² and any δ ∈ (0, 1),

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J_{[]} (δ, F, L_{2} (P)) {∥ F ∥}_{P, 2} (1 + \frac{J_{[]} (δ, F, L_{2} (P))}{δ^{2} \sqrt{n} {∥ F ∥}_{P, 2}}) .

(1.3)

Here the assumption that the class of functions is uniformly bounded is too restrictive for some applications, but can be removed if the entropy integral is computed relative to the stronger “norm”

{∥ f ∥}_{P, B} = {(2 P (e^{∣ f ∣} - 1 - ∣ f ∣))}^{1 ∕ 2} .

Although it is not a norm, this quantity can be used to define the size of brackets and hence bracketing numbers. Inequality (1.3) is valid for an arbitrary class of functions with ${∥ f ∥}_{P, B} \leq δ {∥ F ∥}_{P, B}$ if the L₂(P)-norm is replaced by ${∥ \cdot ∥}_{P, B}$ in its right side (at four appearances) (see Theorem 3.4.3 of [12]). The “norm” ${∥ \cdot ∥}_{P, B}$ derives from the refined version of Bernstein’s inequality, which was first used in the literature on rates of convergence of minimum contrast estimators in [1] (also see [11]).

Maximal inequalities of type (1.3) using uniform entropy are thus far unavailable. In this note we derive an exact parallel of (1.3) for uniformly bounded functions, and investigate similar inequalities for unbounded functions. The validity of these results seems unexpected, as the stronger control given by bracketing has often been thought necessary for estimates of moduli of continuity. It was suggested to us by Theorem 3.1 and its proof in [4].

1.1. Application to minimum contrast estimators

Inequalities involving the sizes of the functions f are of particular interest in the investigation of empirical minimum contrast estimators. Suppose that ${\hat{θ}}_{n}$ mimimizes a criterion of the type

θ \mapsto P_{n} m_{θ},

for given measurable functions $m_{θ} : X \to R$ indexed by a parameter θ, and that the population contrast satisfies, for a “true” parameter θ₀ and some metric d on the parameter set,

P m_{θ} - P m_{θ_{0}} ≳ d^{2} (θ, θ_{0}) .

A bound on the rate of convergence of ${\hat{θ}}_{n}$ to θ₀ can then be derived from the modulus of continuity of the empirical process $G_{n} m_{θ}$ indexed by the functions m_θ. Specifically (see e.g. [12], 3.2.5) if ϕ_n is a function such that δ ↦ ϕ_n(δ)/δ^α is decreasing for some α < 2 and

E sup_{θ : d (θ, θ_{0}) < δ} ∣ G_{n} (m_{θ} - m_{θ_{0}}) ∣ ≲ ϕ_{n} (δ)

(1.4)

then $d ({\hat{θ}}_{n}, θ_{0}) = O_{P} (δ_{n})$ , for δ_n any solution to

ϕ_{n} (δ_{n}) \leq \sqrt{n} δ_{n}^{2} .

(1.5)

Inequality (1.4) involves the empirical process indexed by the class of functions $M_{δ} = {m_{θ} - m_{θ_{0}} : d (θ, θ_{0}) < δ}$ . If d dominates the L₂(P)-norm, or another norm ∥ ⋅ ∥ that can be used in an equality of the type (1.3), such as the Bernstein norm, and the norms of the envelopes of the classes M_δ are bounded in δ, then we can choose

ϕ_{n} (δ) = J (δ, M_{δ}, ∥ \cdot ∥) (1 + \frac{J (δ, M_{d e l t a}, ∥ \cdot ∥)}{δ^{2} \sqrt{n}}),

where J is an appropriate entropy integral. For this choice the inequality (1.5) is equivalent to

J (δ_{n}, M_{δ_{n}}, ∥ \cdot ∥) \leq \sqrt{n} δ_{n}^{2} .

(1.6)

Thus a rate of convergence can be read off directly from the entropy integral.

We note that an inequality of type (1.3) is unattractive for very small δ, as the bound may even increase to infinity as δ ↓ 0. However, it is accurate for the range of δ that are important in the application to moduli of continuity.

Moduli of continuity also play an important role in model selection theorems. See for instance [7].

Inequalities involving uniform entropy permit for instance the immediate derivation of rates of convergence for minimum contrast functions that form VC-classes. Furthermore, uniform entropy is preserved under various (combinatorial) operations to make new classes of functions. This makes uniform entropy integrals a useful tool in situations where bracketing numbers may be difficult to handle. Equation (1.6) gives an elegant characterization of rates of convergence in these situations, where thus far ad-hoc arguments were necessary.

2. Uniformly bounded classes

Call the class $F$ of functions P-measurable if the map

(X_{1}, \dots, X_{n}) \mapsto sup_{f \in F} ∣ \sum_{i = 1}^{n} e_{i} f (X_{i}) ∣

on the completion of the probability space ( $X^{n}, A^{n}, P^{n}$ ) is measurable, for every sequence e₁, e₂, … , e_n ∈ {−1, 1}.

Theorem 2.1. Let $F$ be a P-measurable class of measurable functions with envelope function F ≤ 1 and such that $F^{2}$ is P-measurable. If Pf² < δ²PF², for every f and some δ ∈ (0, 1), then

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J (δ, F, L_{2}) (1 + \frac{J (δ, F, L_{2})}{δ^{2} \sqrt{n} {∥ F ∥}_{P, 2}}) {∥ F ∥}_{P, 2} .

Proof. We use the following refinement of (1.1) (see e.g. [12], 2.14.1): for any P-measurable class $F$ ,

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ E_{P}^{*} J (\frac{{sup}_{f} {(P_{n} f^{2})}^{1 ∕ 2}}{{(P_{n} F^{2})}^{1 ∕ 2}}, F, L_{2}) {(P_{n} F^{2})}^{1 ∕ 2} .

(2.1)

Because $δ \mapsto J (δ, F, L_{2})$ is the integral of a nonincreasing nonnegative function, it is a concave function such that the map t ↦ J(t)/t, which is the average of its derivative over [0, t], is nonincreasing. The concavity shows that its perspective $(x, t) \mapsto t J (x ∕ t, F, L_{2})$ is a concave function of its two arguments (cf. [2], page 89). Furthermore, the “extended-value extension” of this function (which by definition is −∞ if x ≤ 0 or t ≤ 0) is obviously nondecreasing in its first argument and was noted to be nondecreasing in its second argument. Therefore, by the vector composition rules for concave functions ([2], pages 83–87, especially lines -2 and -1 of page 86), the function $(x, y) \mapsto H (x, y) ≔ J (\sqrt{x ∕ y}, F, L_{2}) \sqrt{y}$ is concave. We have that $E_{P}^{*} P_{n} F_{2} = {∥ F ∥}_{P, 2}^{2}$ . Therefore, by an application of Jensen’s inequality to the right side of the preceding display we obtain, for $σ_{n}^{2} = {sup}_{f} P_{n} f^{2}$ ,

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J (\frac{\sqrt{E_{P}^{*} σ_{n}^{2}}}{{∥ F ∥}_{P, 2}}, F, L_{2}) {∥ F ∥}_{P, 2} .

(2.2)

The application of Jensen’s inequality with outer expectations can be justified here by the monotonicity of the function H, which shows that the measurable majorant of a variable H(U, V) is bounded above by H(U*, V*), for U* and V* measurable majorants of U and V. Thus E*H(U, V) ≤ EH(U*, V*), after which Jensen’s inequality can be applied in its usual (measurable) form.

The second step of the proof is to bound $E_{P}^{*} σ_{n}^{2}$ . Because $P_{n} f^{2} = P f^{2} + n^{- 1 ∕ 2} G_{n} f^{2}$ and Pf² ≤ δ²PF² for every f, we have

E_{P}^{*} σ_{n}^{2} \leq δ^{2} {∥ F ∥}_{P, 2}^{2} + \frac{1}{\sqrt{n}} E_{P}^{*} {∥ G_{n} ∥}_{F^{2}} .

(2.3)

Here the empirical process in the second term can be replaced by the symmetrized empirical process $G_{n}^{o}$ (defined as $G_{n}^{o} f = n^{- 1 ∕ 2} \sum_{i = 1}^{n} ε_{i} f (X_{i})$ for independent Rademacher variables ε₁, ε₂, … , ε_n) at the cost of adding a multiplicative factor 2 (e.g. [12], 2.3.1). The expectation can be factorized as the expectation on the Rademacher variables ε followed by the expectation on X₁, …, X_n, and $E_{ε} {∥ G_{n}^{o} ∥}_{F^{2}} \leq 2 E_{ε} {∥ G_{n}^{o} ∥}_{F}$ by the contraction principle for Rademacher variables ([6], Theorem 4.12), and the fact that F ≤ 1 by assumption. Taking the expectation on X₁, … , X_n, we obtain that $E_{P}^{*} {∥ G_{n} ∥}_{F^{2}} \leq 4 E_{P}^{*} {∥ G_{n}^{o} ∥}_{F}$ , which in turn is bounded above by $8 E_{P}^{*} {∥ G_{n} ∥}_{F}$ by the desymmetrization inequality (e.g. 2.36 in [12]).

Thus $F^{2}$ in the last term of (2.3) can be replaced by $F$ , at the cost of inserting a constant. Next we apply (2.2) to this term, and conclude that $z^{2} ≔ E_{P}^{*} σ_{n}^{2} ∕ {∥ F ∥}_{P, 2}^{2}$ satisfies the inequality

z^{2} ≲ δ^{2} + \frac{J (z, F, L_{2})}{\sqrt{n} {∥ F ∥}_{P, 2}} .

(2.4)

We apply Lemma 2.1 with r = 1, A = δ and $B^{2} = 1 ∕ (\sqrt{n} {∥ F ∥}_{P, 2})$ to see that

J (z, F, L_{2}) ≲ J (δ, F, L_{2}) + \frac{J^{2} (δ, F, L_{2})}{δ^{2} \sqrt{n} {∥ F ∥}_{P, 2}} .

We insert this in (2.2) to complete the Proof.

Lemma 2.1. Let $J : (0, \infty) \to R$ be a concave, nondecreasing function with J(0) = 0. If z² ≤ A² + B²J(z^r) for some r ∈ (0, 2) and A,B > 0, then

J (z) ≲ J (A) {[1 + J (A^{r}) {(\frac{B}{A})}^{2}]}^{1 ∕ (2 - r)} .

Proof. For t > s > 0 we can write s as the convex combination s = (s/t)t+ (1−s/t)0 of t and 0. Since J(0) = 0, the concavity of J gives that J(s) ≥ (s/t)J(t). Thus the function t ↦ J(t)/t is decreasing, which implies that J(Ct) ≤ CJ(t) for C ≥ 1 and any t > 0.

By the monotonicity of J and the assumption on z it follows that

J (z^{r}) \leq J ({(A^{2} + B^{2} J (z^{r}))}^{r ∕ 2}) \leq J (A^{r}) {(1 + {(\frac{B}{A})}^{2} J (z^{r}))}^{r ∕ 2} .

This implies that J(z^r) is bounded by a multiple of the maximum of J(A^r) and J(A^r)(B/A)^rJ(z^r)^r/2. If it is bounded by the second one, then $J {(z^{r})}^{1 - r ∕ 2} ≲ J (A^{r}) {(B ∕ A)}^{r}$ . We conclude that

J (z^{r}) ≲ J {(A^{r})}^{2 ∕ (2 - r)} {(\frac{B}{A})}^{2 r ∕ (2 - r)} .

Next again by the monotonicity of J,

\begin{matrix} J (z) & \leq J (\sqrt{A^{2} + B^{2} J (z^{r})}) \leq J (A) \sqrt{1 + {(\frac{B}{A})}^{2} J (z^{r})} \\ ≲ J (A) {[1 + {(\frac{B}{A})}^{2} (J (A^{r}) + J {(A^{r})}^{2 ∕ (2 - r)} {(\frac{B}{A})}^{2 r ∕ (2 - r)})]}^{1 ∕ 2} \\ ≲ J (A) [1 + \sqrt{J (A^{r})} (\frac{B}{A}) + {(\frac{B}{A})}^{1 ∕ (2 - r)}] . \end{matrix}

The middle term on the right side is bounded by a multiple of the sum of the first and third terms, since $x ≲ 1^{p} + x^{q}$ for any conjugate pair (p, q) and any x > 0, in particular $x = \sqrt{J (A^{r})} B ∕ A$ .

For values of δ such that $δ {∥ F ∥}_{P, 2} ⪡ 1 ∕ \sqrt{n}$ Theorem 2.1 can be improved. (This seems not to be of prime interest for statistical applications.) Its bound can be written in the form $J (δ, F, L_{2}) {∥ F ∥}_{P, 2} + J^{2} (δ, F, L_{2}) ∕ (δ^{2} \sqrt{n})$ . In the second term δ can be replaced by $1 ∕ ({∥ F ∥}_{P, 2} \sqrt{n})$ , which is better if δ is smaller than the latter number, as the function $δ \mapsto J (δ, F, L_{2}) ∕ δ$ is decreasing.

Lemma 2.2. Under the conditions of Theorem 2.1,

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J (δ, F, L_{2}) {∥ F ∥}_{P, 2} + J^{2} (\frac{1}{\sqrt{n} {∥ F ∥}_{P, 2}}, F, L_{2}) \sqrt{n} {∥ F ∥}_{P, 2}^{2} .

Proof. We follow the proof of Theorem 2.1 up to (2.4), but next use the alternative bounds

\begin{matrix} J (z, F, L_{2}) & ≲ J (δ^{2} + \frac{J (z, F, L_{2})}{\sqrt{n} {∥ F ∥}_{P, 2}}, F, L_{2}) \\ \leq J (δ, F, L_{2}) + J (\frac{J (z, F, L_{2})}{\sqrt{n} {∥ F ∥}_{P, 2}}, F, L_{2}) \\ \leq J (δ, F, L_{2}) + J (δ_{n}, F, L_{2}) \sqrt{\frac{J (z, F, L_{2})}{δ_{n}} \lor 1,} \end{matrix}

for $1 ∕ δ_{n} = \sqrt{n} {∥ F ∥}_{P, 2}$ . Here we have used the subadditivity of the map $δ \mapsto J (δ, F, L_{2})$ , and the inequality $J (C δ, F, l_{2}) \leq C J (δ, F, L_{2})$ for C ≥ 1 in the last step. We can bound the sum of the three terms on the right side by a multiple of the maximum of these terms and conclude that the left side is smaller than at least one of the three terms. Solving next yields that

J (z, F, L_{2}) ≲ J (z, F, L_{2}) \lor \frac{J^{2} (δ_{n}, F, L_{2})}{δ_{n}} \lor J (δ_{n}, F, L_{2}) .

Because $J (δ_{n}, F, L_{2}) \geq δ_{n}$ for every δ_n > 0, by the definition of the entropy integral, the third term on the right is bounded by the second term. We substitute the bound in (2.2) to finish the proof.

3. Unbounded classes

In this section we investigate relaxations of the assumption that the class $F$ of functions is uniformly bounded, made in Theorem 2.1. We start with a moment bound on the envelope.

Theorem 3.1. Let $F$ be a P-measurable class of measurable functions with envelope function F such that PF^{(4p−2)/(p−1)} < ∞ for some p > 1 and such that $F^{2}$ and $F^{4}$ are P-measurable. If Pf² < δ²PF² for every f and some δ ∈ (0, 1), then

E_{P}^{*} {∥ G_{n} ∥}_{F} ≲ J (δ, F, L_{2}) {(1 + \frac{J (δ^{1 ∕ p}, F, L_{2})}{δ^{2} \sqrt{n}} \frac{{∥ F ∥}_{P, (4 p - 2) ∕ (p - 1)}^{2 - 1 ∕ p}}{{∥ F ∥}_{P, 2}^{2 - 1 ∕ p}})}^{p ∕ (2 p - 1)} {∥ F ∥}_{P, 2} .

Proof. Application of (2.1) to the functions f², forming the class $F^{2}$ with envelope function F², yields

E_{P}^{*} {∥ G_{n} ∥}_{F^{2}} ≲ E_{P}^{*} J (\frac{σ_{n, 4}^{2}}{{(P_{n} F^{4})}^{1 ∕ 2}}, F^{2} ∣ F^{2}, L_{2}) {(P_{n} F^{4})}^{1 ∕ 2},

(3.1)

for σ_n,r the diameter of $F$ in $L_{r} (P_{n})$ , i.e.

σ_{n, r}^{r} = sup_{f} P_{n} {∣ f ∣}^{r} .

(3.2)

Preservation properties of uniform entropy (see [10], or [12], 2.10.20, where the supremum over Q can also be moved outside the integral to match our current definition of entropy integral, applied to ϕ(f) = f² with L = 2F) show that $J (δ, F^{2} ∣ F^{2}, L_{2}) ≲ J (δ, F ∣ F, L_{2})$ , for every δ > 0. Because $P_{n} f^{2} = P f^{2} + n^{- 1 ∕ 2} G_{n} f^{2}$ and Pf² ≤ δ²PF² by assumption, we fing that

E_{P}^{*} σ_{n, 2}^{2} ≲ δ^{2} P F^{2} + \frac{1}{\sqrt{n}} E_{P}^{*} J (\frac{σ_{n, 4}^{2}}{{(P_{n} F^{4})}^{1 ∕ 2}}, F, L_{2}) {(P_{n} F^{4})}^{1 ∕ 2} .

(3.3)

The next step is to bound σ_n,4 in terms of σ_n,2.

By Hölder’s inequality, for any conjugate pair (p, q) and any 0 < s < 4,

P_{n} f^{4} \leq P_{n} {∣ f ∣}^{4 - s} F^{s} \leq {(P^{n} {∣ f ∣}^{(4 - s) p})}^{1 ∕ p} {(P_{n} F^{s q})}^{1 ∕ q} .

Choosing s such that (4 − s)p = 2 (and hence sq = (4p − 2)/(p − 1)), we find that

σ_{n, 4}^{4} \leq σ_{n, 2}^{2 ∕ p} {(P_{n} F^{s q})}^{1 ∕ q} .

We insert this bound in (3.3). The function (x, y) ↦ x^1/py^1/q is concave, and hence the function $(x, y, z) \mapsto J (\sqrt{x^{1 ∕ p} y^{1 ∕ q} ∕ z}, F, L_{2}) \sqrt{z}$ can be seen to be concave by the same arguments as in the proof of Theorem 2.1. Therefore, we can apply Jensen’s inequality to see that

E_{P}^{*} σ_{n, 2}^{2} ≲ δ^{2} P F^{2} + \frac{1}{\sqrt{n}} J (\frac{{(E_{P}^{*} σ_{n, 2}^{2})}^{1 ∕ (2 p)} (P F^{s q}) 1 ∕ (2 q)}{{(P F^{4})}^{1 ∕ 2}}, F, L_{2}) {(P F^{4})}^{1 ∕ 2} .

We conclude that $z ≔ {(E_{P}^{*} σ_{n, 2}^{2})}^{1 ∕ 2} ∕ {∥ F ∥}_{P, 2}$ satisfies

\begin{matrix} z^{2} & ≲ δ^{2} + \frac{1}{\sqrt{n}} J (z^{1 ∕ p} \frac{{(P F^{2})}^{1 ∕ (2 p)} {(P F^{s q})}^{1 ∕ (2 q)}}{{(P F^{4})}^{1 ∕ 2}}, F, L_{2}) \frac{{(P F^{4})}^{1 ∕ 2}}{P F^{2}} \\ ≲ δ^{2} + J (z^{1 ∕ p}, F, L^{2}) \frac{{(P F^{s q})}^{1 ∕ (2 q)}}{\sqrt{n} {(P F^{2})}^{1 - 1 ∕ (2 p)}} . \end{matrix}

In the last step we use that $J (C δ, F, L_{2}) \leq C J (δ, F, L_{2})$ for C ≥ 1, and Hölder’s inequality as previously to see that the present C satisfies this condition. We next apply Lemma 2.1 (with r = 1/p) to obtain a bound on $J (z, F, L_{2})$ , and conclude the proof by substituting this bound in (2.2).

The preceding theorem assumes only a finite moment of the envelope function, but in comparison to Theorem 2.1 substitutes $J (δ^{1 ∕ p}, F, L_{2})$ in the correction term of the upper bound, where p > 1 and hence δ1/p ≫ δ for small δ. In applications to moduli of continuity of minimum contrast criteria this is sufficient to obtain consistency with a rate, but typically the rate will be suboptimal. The rate improves as p ↓ 1, which requires finite moments of the envelope function of order increasing to infinity, the limiting case p = 1 corresponding to a bounded envelope, as in Theorem 2.1. The following theorem interpolates between finite moments of any order and a bounded envelope function. If applied to obtaining rates of convergence it gives rates that are optimal up to a logarithmic factor.

Theorem 3.2. Let $F$ be a P-measurable class of measurable functions with envelope function F such that P exp(F^p+ρ) < ∞ for some p, ρ > 0 and such that $F^{2}$ and $F^{4}$ are P-measurable. If Pf² < δ²PF² for every f and some δ ∈ (0, 1/2), then for a constant c dependent on p, PF², PF⁴ and P exp(F^p+ρ),

E_{P}^{*} {∥ G_{n} ∥}_{F} \leq c J (δ, F, L_{2}) (1 + \frac{J (δ {(\log (1 ∕ δ))}^{1 ∕ p}, F, L_{2})}{δ^{2} \sqrt{n}}) .

Proof. Fix r = 2/p. The functions ψ, $\bar{ψ} : [0, \infty) \mapsto [0, \infty)$ defined by

ψ (f) = \log^{r} (1 + f), \bar{ψ} (f) = e^{f^{1 ∕ r}} - 1,

are each other’s inverses, and are increasing from $ψ (0) = \bar{ψ} (0) = 0$ to infinity. Thus their primitive functions $Ψ (f) = \int_{0}^{f} ψ (s) d s$ and $\bar{Ψ} (f) = \int_{0}^{f} \bar{ψ} (s)$ satisfy Young’s inequality $f g \leq Ψ (f) + \bar{Ψ} (g)$ , for every f, g ≥ 0 (e.g. [2], page 120, 3.38).

The function t ↦ t log^r(1/t) is concave in a neighbourhood of 0 (specifically: on the interval (0, e^1−r ∧ 1)), with limit from the right equal to 0 at 0, and derivative tending to infinity at this point. Therefore, there exists a concave, increasing function k: (0, ∞) → (0, ∞) that is identical to t ↦ t log^r(1/t) near 0 and bounded below and above by a positive constant times the identity throughout its domain. (E.g. extend t ↦ t log^r(1/t) linearly with slope 1 from the point where the derivative of the latter function has decreased to 1.) Write k(t) = tl^r(t), so that l^r is bounded below by a constant and l(t) = log(1/t) near 0. Then, for every t > 0,

\frac{\log (2 + t ∕ C)}{l (C)} ≲ \log (2 + t) .

(3.4)

(The constant in ≲ may depend on r.) To see this, note that for C > c the left side is bounded by a multiple of log(2 + t/c), whereas for small C the left side is bounded by a multiple of $[\log (2 + t) + \log (1 + 1 ∕ C)] ∕ l (C) ≲ \log (2 + t) + 1$ .

From the inequality Ψ(f) ≤ fψ(f), we obtain that, for f > 0,

Ψ (\frac{f}{\log^{r} (2 + f)}) ≲ f .

Therefore, by (3.4) followed by Young’s inequality,

\begin{matrix} \frac{f^{4}}{k (C^{2})} & = \frac{f^{2} ∕ C^{2}}{\log^{r} (2 + f^{2} ∕ C^{2})} \frac{f^{2} \log^{r} (2 + f^{2} ∕ C^{2})}{l^{r} (C^{2})} \\ ≲ \frac{f^{2}}{C^{2}} + \bar{Ψ} (F^{2} \log^{r} (2 + F^{2})) . \end{matrix}

On integrating this with respect to the empirical measure, with $C^{2} = P_{n} f^{2}$ , we, see that, with $G = \bar{Ψ} (F^{2} \log^{r} (2 + F^{2}))$ ,

P_{n} f^{4} ≲ k (P_{n} f^{2}) (1 + P_{n} G) .

We take the supremum over f to bound $σ_{n, 4}^{4}$ as in (3.2) in terms of $k (σ_{n, 2}^{2})$ , and next substitute this bound in (3.3) to find that

\begin{matrix} E_{P}^{*} σ_{n, 2}^{2} & \leq δ^{2} P F^{2} + \frac{1}{\sqrt{n}} E_{P}^{*} J (\frac{\sqrt{k (σ_{n, 2}^{2})} \sqrt{1 + P_{n} G}}{{(P_{n} F^{4})}^{1 ∕ 2}}, F, L_{2}) {(P_{n} F^{4})}^{1 ∕ 2} \\ \leq δ^{2} P F^{2} + \frac{1}{\sqrt{n}} J (\frac{\sqrt{k (E_{P}^{*} σ_{n, 2}^{2})} \sqrt{1 + P G}}{{(P F^{4})}^{1 ∕ 2}}, F, L^{2}) (P F^{4}) 1 ∕ 2, \end{matrix}

where we have used the concavity of k, and the concavity of the other maps, as previously. By assumption the expected value PG is finite for r = 2/p. It follows that $z^{2} = E_{P}^{*} σ_{n, 2}^{2} ∕ P F^{2}$ satisfies, for suitable constants a, b, c depending on r, PF², PF⁴ and PG,

z^{2} ≲ δ^{2} + \frac{a}{\sqrt{n}} J (\sqrt{k (z^{2} b)} c, F, L_{2}) .

By concavity and the fact that k(0) = 0, we have k(Cz) ≤ Ck(z), for C ≥ 1 and z > 0. The function $z \mapsto \sqrt{k (z^{2} b)} c$ inherits this property. Therefore we can apply Lemma 3.1, with k of the lemma equal to the present function $z \mapsto \sqrt{k (z^{2} b)} c$ , to obtain a bound on $J (z, F, L_{2})$ in terms of $J (δ, F, L_{2})$ and $J (\sqrt{k (δ^{2} b) c}, F, L_{2})$ , which we substitute in (2.2). Here k(δ²) = δ²log^r(1/δ) for suffciently small δ > 0 and $k (δ^{2}) ≲ δ^{2} ≲ δ^{2} \log^{r} (1 ∕ δ)$ for δ < 1/2 and bounded away from 0. Thus we can simplify the bound to the one in the statement of the theorem, possibly after increasing the constants a, b, c to be at least 1, to complete the proof.

Lemma 3.1. Let $J : (0, \infty) \to R$ be a concave, nondecreasing function with J(0) = 0, and let k: (0, ∞) → (0, ∞) be nondecreasing and satisfy k(Cz) ≤ Ck(z) for C ≥ 1 and z > 0. If z² ≤ A²+B² J(k(z)) for some A, B > 0, then

J (z) ≲ J (A) [1 + J (k (A)) {(\frac{B}{A})}^{2}] .

Proof. As noted in the proof of Lemma 2.1 the properties of J imply that J(Cz) ≤ CJ(z) for C ≥ 1 and any z > 0. In view of the assumed property of k and the monotonicity of J it follows that $J ο k (C z) \leq C J ο k (z)$ for every C ≥ 1 and z > 0. Therefore, by the monotonicity of J and k, and the assumption on z,

J ο k (z) \leq J ο k (\sqrt{A^{2} + B^{2} J ο k (z)}) \leq J ο k (A) \sqrt{1 + {(B ∕ A)}^{2} J ο k (z)} .

As in the proof of Lemma 2.1 we can solve this for J ο k(z) to find that

J ο k (z) ≲ J ο k (A) + J ο k {(A)}^{2} {(\frac{B}{A})}^{2} .

Next again by the monotonicity of J,

\begin{matrix} J (z) & \leq J (\sqrt{A^{2} + B^{2} J ο k (z)}) \leq J (A) \sqrt{1 + {(B ∕ A)}^{2} J ο k (z)} \\ ≲ J (A) [1 + (\frac{B}{A}) \sqrt{J ο k (A)} + {(\frac{B}{A})}^{2} J ο k (A)] . \end{matrix}

The middle term on the right side is bounded by the sum of the first and third terms.

Footnotes

AMS 2000 subject classifications: 60K35.

Contributor Information

Aad van der Vaart, Department of Mathematics, Faculty of Sciences, Vrije Universiteit De Boelelaan 1081a, 1081 HV Amsterdam, aad@cs.vu.nl.

Jon A. Wellner, Department of Statistics, University of Washington, Seattle, WA 98195-4322, jaw@stat.washington.edu.

References

[1].Birgé L, Massart P. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields. 1993;97(1-2):113–150. MR1240719. [Google Scholar]
[2].Boyd S, Vandenberghe L. Convex optimization. Cambridge University Press; Cambridge: 2004. MR2061575. [Google Scholar]
[3].Dudley RM. Central limit theorems for empirical measures. Ann. Probab. 1979;6(6):899–929. 1978. MR0512411. [Google Scholar]
[4].Giné E, Koltchinskii V. Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 2006;34(3):1143–1216. MR2243881. [Google Scholar]
[5].Kolchins’kiĭ VĪ. On the central limit theorem for empirical measures. Teor. Veroyatnost. i Mat. Statist. 1981;24:63–75. 152. MR0628431. [Google Scholar]
[6].Ledoux M, Talagrand M. Probability in Banach spaces, vol. 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] Springer-Verlag; Berlin: 1991. Isoperimetry and processes. MR1102015. [Google Scholar]
[7].Massart P, Nédélec É . Risk bounds for statistical learning. Ann. Statist. 2006;34(5):2326–2366. MR2291502. [Google Scholar]
[8].Ossiander M. A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 1987;15(3):897–919. MR0893905. [Google Scholar]
[9].Pollard D. A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A. 1982;33(2):235–248. MR0668445. [Google Scholar]
[10].Pollard D. Empirical processes: theory and applications; NSF-CBMS Regional Conference Series in Probability and Statistics, 2; Institute of Mathematical Statistics, Hayward, CA. 1990; MR1089429. [Google Scholar]
[11].van de Geer S. The method of sieves and minimum contrast estimators. Math. Methods Statist. 1995;4(1):20–38. MR1324688. [Google Scholar]
[12].van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. MR1385671. [Google Scholar]

[R1] [1].Birgé L, Massart P. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields. 1993;97(1-2):113–150. MR1240719. [Google Scholar]

[R2] [2].Boyd S, Vandenberghe L. Convex optimization. Cambridge University Press; Cambridge: 2004. MR2061575. [Google Scholar]

[R3] [3].Dudley RM. Central limit theorems for empirical measures. Ann. Probab. 1979;6(6):899–929. 1978. MR0512411. [Google Scholar]

[R4] [4].Giné E, Koltchinskii V. Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 2006;34(3):1143–1216. MR2243881. [Google Scholar]

[R5] [5].Kolchins’kiĭ VĪ. On the central limit theorem for empirical measures. Teor. Veroyatnost. i Mat. Statist. 1981;24:63–75. 152. MR0628431. [Google Scholar]

[R6] [6].Ledoux M, Talagrand M. Probability in Banach spaces, vol. 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] Springer-Verlag; Berlin: 1991. Isoperimetry and processes. MR1102015. [Google Scholar]

[R7] [7].Massart P, Nédélec É . Risk bounds for statistical learning. Ann. Statist. 2006;34(5):2326–2366. MR2291502. [Google Scholar]

[R8] [8].Ossiander M. A central limit theorem under metric entropy with L2 bracketing. Ann. Probab. 1987;15(3):897–919. MR0893905. [Google Scholar]

[R9] [9].Pollard D. A central limit theorem for empirical processes. J. Austral. Math. Soc. Ser. A. 1982;33(2):235–248. MR0668445. [Google Scholar]

[R10] [10].Pollard D. Empirical processes: theory and applications; NSF-CBMS Regional Conference Series in Probability and Statistics, 2; Institute of Mathematical Statistics, Hayward, CA. 1990; MR1089429. [Google Scholar]

[R11] [11].van de Geer S. The method of sieves and minimum contrast estimators. Math. Methods Statist. 1995;4(1):20–38. MR1324688. [Google Scholar]

[R12] [12].van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. MR1385671. [Google Scholar]

PERMALINK

A local maximal inequality under uniform entropy

Aad van der Vaart

Jon A Wellner

Abstract

1. Introduction

1.1. Application to minimum contrast estimators

2. Uniformly bounded classes

3. Unbounded classes

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A local maximal inequality under uniform entropy

Aad van der Vaart

Jon A Wellner

Abstract

1. Introduction

1.1. Application to minimum contrast estimators

2. Uniformly bounded classes

3. Unbounded classes

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases