Estimating beta-mixing coefficients

Daniel J McDonald; Cosma Rohilla Shalizi; Mark Schervish

. Author manuscript; available in PMC: 2015 Aug 14.

Published in final edited form as: JMLR Workshop Conf Proc. 2011;15:516–524.

Estimating beta-mixing coefficients

Daniel J McDonald ¹, Cosma Rohilla Shalizi ², Mark Schervish ³

PMCID: PMC4537076 NIHMSID: NIHMS697610 PMID: 26279742

Abstract

The literature on statistical learning for time series assumes the asymptotic independence or “mixing” of the data-generating process. These mixing assumptions are never tested, and there are no methods for estimating mixing rates from data. We give an estimator for the beta-mixing rate based on a single stationary sample path and show it is L1-risk consistent.

1 Introduction

Relaxing the assumption of independence is an active area of research in the statistics and machine learning literature. For time series, independence is replaced by the asymptotic independence of events far apart in time, or “mixing”. Mixing conditions make the dependence of the future on the past explicit by quantifying the decay in dependence as the future moves farther from the past. There are many definitions of mixing of varying strength with matching dependence coefficients (see [9, 7, 4] for reviews), but most of the results in the learning literature focus on β-mixing or absolute regularity. Roughly speaking (see Definition 2.1 below for a precise statement), the β-mixing coefficient at lag a is the total variation distance between the actual joint distribution of events separated by a time steps and the product of their marginal distributions, i.e., the L¹ distance from independence.

Numerous results in the statistical machine learning literature rely on knowledge of the β-mixing coefficient. As Vidyasagar [25, p. 41] notes, β-mixing is “just right” for the extension of IID results to dependent data, and so recent work has consistently focused on it. Meir [15] derives generalization error bounds for nonparametric methods based on model selection via structural risk minimization. Baraud et al. [1] study the finite sample risk performance of penalized least squares regression estimators under β-mixing. Lozano et al. [13] examine regularized boosting algorithms under absolute regularity and prove consistency. Karandikar and Vidyasagar [12] consider “probably approximately correct” learning algorithms, proving that PAC algorithms for IID inputs remain PAC with β-mixing inputs under some mild conditions. Ralaivola et al. [20] derive PAC bounds for ranking statistics and classifiers using a decomposition of the dependency graph. Finally, Mohri and Rostamizadeh [16] derive stability bounds for β-mixing inputs, generalizing existing stability results for IID data.

All these results assume not just β-mixing, but known mixing coefficients. In particular, the risk bounds in [15, 16] and [20] are incalculable without knowledge of the rates. This knowledge is never available. Unless researchers are willing to assume specific values for a sequence of β-mixing coefficients, the results mentioned in the previous paragraph are generally useless when confronted with data. To illustrate this deficiency, consider Theorem 18 of [16]:

Theorem 1.1

(Briey). Assume a learning algorithm is λ-stable.¹ Then, for any sample of size n drawn from a stationary β-mixing distribution, and ε > 0

ℙ (| R - \hat{R} | > ε) \leq Γ (n, λ, ε, a, b) + β (a) (μ_{n} - 1)

where n = (a + b)μ_n, Γ has a particular functional form, and R − R̂ is the difference between the true risk and the empirical risk.

Ideally, one could use this result for model selection or to control the size of the generalization error of competing prediction algorithms (support vector machines, support vector regression, and kernel ridge regression are a few of the many algorithms known to satisfy λ-stability). However the bound depends explicitly on the mixing coefficient β(a). To make matters worse, there are no methods for estimating the β-mixing coefficients. According to Meir [15, p. 7], “there is no efficient practical approach known at this stage for estimation of mixing parameters.” We begin to rectify this problem by deriving the first method for estimating these coefficients. We prove that our estimator is consistent for arbitrary β-mixing processes. In addition, we derive rates of convergence for Markov approximations to these processes.

Application of statistical learning results to β-mixing data is highly desirable in applied work. Many common time series models are known to be β-mixing, and the rates of decay are known given the true parameters of the process. Among the processes for which such knowledge is available are ARMA models [17], GARCH models [5], and certain Markov processes — see [9] for an overview of such results. To our knowledge, only Nobel [18] approaches a solution to the problem of estimating mixing rates by giving a method to distinguish between different polynomial mixing rate regimes through hypothesis testing.

We present the first method for estimating the β-mixing coefficients for stationary time series data. Section 2 defines the β-mixing coefficient and states our main results on convergence rates and consistency for our estimator. Section 3 gives an intermediate result on the L¹ convergence of the histogram estimator with β-mixing inputs. Section 4 proves the main results from §2. Section 5 concludes and lays out some avenues for future research.

2 Estimation of β-mixing

In this section, we present one of many equivalent definitions of absolute regularity and state our main results, deferring proof to §4.

To fix notation, let $X = {X_{t}}_{t = - \infty}^{\infty}$ be a sequence of random variables where each X_t is a measurable function from a probability space (Ω, ℱ, ℙ) into a measurable space 𝒳. A block of this random sequence will be given by $X_{i}^{j} \equiv {X_{t}}_{t = i}^{j}$ where i and j are integers, and may be infinite. We use similar notation for the sigma fields generated by these blocks and their joint distributions. In particular, $σ_{i}^{j}$ will denote the sigma field generated by $X_{i}^{j}$ , and the joint distribution of $X_{i}^{j}$ will be denoted $ℙ_{i}^{j}$ .

2.1 Definitions

There are many equivalent definitions of β-mixing (see for instance [9], or [4] as well as Meir [15] or Yu [28]), however the most intuitive is that given in Doukhan [9].

Definition 2.1

(β-mixing). For each positive integer a, the coefficient of absolute regularity, or β-mixing coefficient, β(a), is

β (a) \equiv sup_{t} {‖ ℙ_{- \infty}^{t} \otimes ℙ_{t + a}^{\infty} - ℙ_{t, a} ‖}_{T V}

(1)

where ‖·‖_TV is the total variation norm, and ℙ_t,a is the joint distribution of ( $X_{- \infty}^{t}, X_{t + a}^{\infty}$ ). A stochastic process is said to be absolutely regular, or β-mixing, if β(a) → 0 as a → ∞.

Loosely speaking, Definition 2.1 says that the coefficient β(a) measures the total variation distance between the joint distribution of random variables separated by a time units and a distribution under which random variables separated by a time units are independent. The supremum over t is unnecessary for stationary random processes X which is the only case we consider here.

Definition 2.2

(Stationarity). A sequence of random variables X is stationary when all its finite-dimensional distributions are invariant over time: for all t and all non-negative integers i and j, the random vectors $X_{t}^{t + i}$ and $X_{t + j}^{t + i + j}$ have the same distribution.

Our main result requires the method of blocking used by Yu [27, 28]. The purpose is to transform a sequence of dependent variables into subsequence of nearly IID ones. Consider a sample $X_{1}^{n}$ from a stationary β-mixing sequence with density f. Let m_n and μ_n be non-negative integers such that 2m_nμ_n = n. Now divide $X_{1}^{n}$ into 2μ_n blocks, each of length m_n. Identify the blocks as follows:

U_{j} = {X_{i} : 2 (j - 1) m_{n} + 1 \leq i \leq (2 j - 1) m_{n}},

V_{j} = {X_{i} : (2 j - 1) m_{n} + 1 \leq i \leq 2 j m_{n}} .

Let U be the entire sequence of odd blocks U_j, and let V be the sequence of even blocks V_j. Finally, let U′ be a sequence of blocks which are independent of $X_{1}^{n}$ but such that each block has the same distribution as a block from the original sequence:

U_{j}^{'} \overset{D}{=} U_{j} \overset{D}{=} U_{1} .

(2)

The blocks U′ are now an IID block sequence, so standard results apply. (See [28] for a more rigorous analysis of blocking.) With this structure, we can state our main result.

2.2 Results

Our main result emerges in two stages. First, we recognize that the distribution of a finite sample depends only on finite-dimensional distributions. This leads to an estimator of a finite-dimensional version of β(a). Next, we let the finite-dimension increase to infinity with the size of the observed sample.

For positive integers t, d, and a, define

β^{d} (a) \equiv {‖ ℙ_{t - d + 1}^{t} \otimes ℙ_{t + a}^{t + a + d - 1} - ℙ_{t, a, d} ‖}_{T V},

(3)

where ℙt,a,d is the joint distribution of ( $X_{t - d + 1}^{t}, X_{t + a}^{t + a + d - 1}$ ). Also, let f̂^d be the d-dimensional histogram estimator of the joint density of d consecutive observations, and let ${\hat{f}}_{a}^{2 d}$ be the 2d-dimensional histogram estimator of the joint density of two sets of d consecutive observations separated by a time points.

We construct an estimator of β^d(a) based on these two histograms.² Define

{\hat{β}}^{d} (a) \equiv \frac{1}{2} \int | {\hat{f}}_{a}^{2 d} - {\hat{f}}^{d} \otimes {\hat{f}}^{d} |

(4)

We show that, by allowing d = d_n to grow with n, this estimator will converge on β(a). This can be seen most clearly by bounding the ℓ¹-risk of the estimator with its estimation and approximation errors:

| {\hat{β}}^{d} (a) - β (a) | \leq | {\hat{β}}^{d} (a) - β^{d} (a) | + | β^{d} (a) - β (a) | .

The first term is the error of estimating β^d(a) with a random sample of data. The second term is the non-stochastic error induced by approximating the infinite dimensional coefficient, β(a), with its d-dimensional counterpart, β^d(a).

Our first theorem in this section establishes consistency of β̂^d_n (a) as an estimator of β(a) for all β-mixing processes provided d_n increases at an appropriate rate. Theorem 2.4 gives finite sample bounds on the estimation error while some measure theoretic arguments contained in §4 show that the approximation error must go to zero as d_n → ∞.

Theorem 2.3

Let $X_{1}^{n}$ be a sample from an arbitrary β-mixing process. Let d_n = O(exp{W(log n)}) where W is the Lambert W function.³ Then ${\hat{β}}^{d_{n}} (a) \overset{P}{\to} β (a)$ as n → ∞.

A finite sample bound for the approximation error is the first step to establishing consistency for β̂^d(a). This result gives convergence rates for estimation of the finite dimensional mixing coefficient β^d(a) and also for Markov processes of known order d, since in this case, β^d(a) = β(a).

Theorem 2.4

Consider a sample $X_{1}^{n}$ from a stationary β-mixing process. Let μ_n and m_n be positive integers such that 2μ_nm_n = n and μ_n ≥ d > 0. Then

ℙ (| {\hat{β}}^{d} (a) - β^{d} (a) | > ε) \leq 2 exp {- \frac{μ_{n} ε_{1}^{2}}{2}} + 2 exp {- \frac{μ_{n} ε_{2}^{2}}{2}} + 4 (μ_{n} - 1) β (m_{n}),

where ε₁ = ε/2 − 𝔼 [∫|f̂^d − f^d|] and $ε_{2} = ε - 𝔼 [\int | {\hat{f}}_{a}^{2 d} - f_{a}^{2 d} |]$ .

Consistency of the estimator β̂^d(a) is guaranteed only for certain choices of m_n and μ_n. Clearly μ_n → ∞ and μ_nβ(m_n) → 0 as n → ∞ are necessary conditions. Consistency also requires convergence of the histogram estimators to the target densities. We leave the proof of this theorem for section 4. As an example to show that this bound can go to zero with proper choices of m_n and μ_n, the following corollary proves consistency for first order Markov processes. Consistency of the estimator for higher order Markov processes can be proven similarly. These processes are geometrically β-mixing as shown in e.g. Nummelin and Tuominen [19].

Corollary 2.5

Let $X_{1}^{n}$ be a sample from a first order Markov process with β(a) = β¹(a) = O(r^a) for some 0 ≤ r < 1. Then under the conditions of Theorem 2.4, ${\hat{β}}^{1} (a) \overset{P}{\to} β (a)$ at a rate of $o (\sqrt{n})$ up to a logarithmic factor.

Proof

Recall that n = 2μ_nm_n. Then,

4 (μ_{n} - 1) β (m_{n}) = 4 μ_{n} β (m_{n}) + 4 β (m_{n}) = K_{1} \frac{n}{m_{n}} r^{m_{n}} + K_{2} r^{m_{n}} \to 0

if m_n = Ω(log n) for constants K₁ and K₂. But the exponential terms are

exp {- K_{3} \frac{n ε_{j}^{2}}{m_{n}}}

for j = 1, 2 and a constant K₃. Therefore, both exponential terms go to 0 as n → ∞ for m_n = o(n). Balancing the rates gives the optimal choice of $m_{n} = o (\sqrt{n})$ with corresponding rate of convergence (up to a logarithmic factor) of $o (\sqrt{n})$ .

Proving Theorem 2.4 requires showing the L¹ convergence of the histogram density estimator with β-mixing data. We do this in the next section.

3 L¹ convergence of histograms

Convergence of density estimators is thoroughly studied in the statistics and machine learning literature. Early papers on the L^∞ convergence of kernel density estimators (KDEs) include [26, 2, 22]; Freedman and Diaconis [10] look specifically at histogram estimators, and Yu [27] considered the L^∞ convergence of KDEs for β-mixing data and shows that the optimal IID rates can be attained. Devroye and Györfi [8] argue that L¹ is a more appropriate metric for studying density estimation, and Tran [23] proves L¹ consistency of KDEs under α- and β-mixing. As far as we are aware, ours is the first proof of L¹ convergence for histograms under β-mixing.

Additionally, the dimensionality of the target density is analogous to the order of the Markov approximation. Therefore, the convergence rates we give are asymptotic in the bandwidth h_n which shrinks as n increases, but also in the dimension d which increases with n. Even under these asymptotics, histogram estimation in this sense is not a high dimensional problem. The dimension of the target density considered here is on the order of exp{W(log n)}, a rate somewhere between log n and log log n.

Theorem 3.1

If f̂ is the histogram estimator based on a (possibly vector valued) sample $X_{1}^{n}$ from a β-mixing sequence with stationary density f, then for all ε > 𝔼 [∫|f̂ − f|],

ℙ (\int | \hat{f} - f | > ε) \leq 2 exp {- \frac{μ_{n} ε_{1}^{2}}{2}} + 2 (μ_{n} - 1) β (m_{n})

(5)

where ε₁ = ε − 𝔼 [∫|f̂ − f|].

To prove this result, we use the blocking method of Yu [28] to transform the dependent β-mixing into a sequence of nearly independent blocks. We then apply McDiarmid’s inequality to the blocks to derive asymptotics in the bandwidth of the histogram as well as the dimension of the target density. For completeness, we state Yu’s blocking result and McDiarmid’s inequality before proving the doubly asymptotic histogram convergence for IID data. Combining these lemmas allows us to derive rates of convergence for histograms based on β-mixing inputs.

Lemma 3.2

(Lemma 4.1 in Yu [28]). Let ϕ be a measurable function with respect to the block sequence U uniformly bounded by M. Then,

| 𝔼 [ϕ] - \tilde{𝔼} [ϕ] | \leq M β (m_{n}) (μ_{n} - 1),

(6)

where the first expectation is with respect to the dependent block sequence, U, and 𝔼̃ is with respect to the independent sequence, U′.

This lemma essentially gives a method of applying IID results to β-mixing data. Because the dependence decays as we increase the separation between blocks, widely spaced blocks are nearly independent of each other. In particular, the difference between expectations over these nearly independent blocks and expectations over blocks which are actually independent can be controlled by the β-mixing coefficient.

Lemma 3.3

(McDiarmid Inequality [14]). Let X₁,…, X_n be independent random variables, with X_i taking values in a set A_i for each i. Suppose that the measurable function f : ∏ A_i → ℝ satisfies

| f (x) - f (x') | \leq c_{i}

whenever the vectors x and x′ differ only in the i^th coordinate. Then for any ε > 0,

ℙ (f - 𝔼 f > ε) \leq exp {- \frac{2 ε^{2}}{\sum c_{i}^{2}}} .

Lemma 3.4

For an IID sample X₁,…, X_n from some density f on ℝ^d

𝔼 \int | \hat{f} - 𝔼 \hat{f} | d x = O (1 / \sqrt{n h_{n}^{d}})

(7)

\int | 𝔼 \hat{f} - f | d x = O (d h_{n}) + O (d^{2} h_{n}^{2}),

(8)

where f̂ is the histogram estimate using a grid with sides of length h_n.

Proof of Lemma 3.4

Let p_j be the probability of falling into the j^th bin B_j. Then,

𝔼 \int | \hat{f} - 𝔼 \hat{f} | = h_{n}^{d} \sum_{j = 1}^{J} 𝔼 | \frac{1}{n h_{n}^{d}} \sum_{i = 1}^{n} I (X_{i} \in B_{j}) - \frac{p_{j}}{h^{d}} | \leq h_{n}^{d} \sum_{j = 1}^{J} \frac{1}{n h_{n}^{d}} \sqrt{𝕍 [\sum_{i = 1}^{n} I (X_{i} \in B_{j})]} = h_{n}^{d} \sum_{j = 1}^{J} \frac{1}{n h_{n}^{d}} \sqrt{n p_{j} (1 - p_{j})} = \frac{1}{\sqrt{n}} \sum_{j = 1}^{J} \sqrt{p_{j} (1 - p_{j})} = O (n^{- 1 / 2}) O (h_{n}^{- d / 2}) = O (1 / \sqrt{n h_{n}^{d}}) .

For the second claim, consider the bin B_j centered at c. Let I be the union of all bins B_j. Assume the following:

f ∈ L₂ and f is absolutely continuous on I, with a.e. partial derivatives $f_{i} = \frac{\partial}{\partial y_{i}} f (y)$
f_i ∈ L₂ and f_i is absolutely continuous on I, with a.e. partial derivatives $f_{i k} = \frac{\partial}{\partial y_{k}} f_{i} (y)$
f_ik ∈ L₂ for all i, k.

Using a Taylor expansion

f (x) = f (c) + \sum_{i = 1}^{d} (x_{i} - c_{i}) f_{i} (c) + O (d^{2} h_{n}^{2}),

where $f_{i} (y) = \frac{\partial}{\partial y_{i}} f (y)$ . Therefore, p_j is given by

p_{j} = \int_{B_{j}} f (x) d x = h_{n}^{d} f (c) + O (d^{2} h_{n}^{d + 2})

since the integral of the second term over the bin is zero. This means that for the j^th bin,

𝔼 \hat{f_{n}} (x) - f (x) = \frac{p_{j}}{h_{n}^{d}} - f (x) = - \sum_{i = 1}^{d} (x_{i} - c_{i}) f_{i} (c) + O (d^{2} h_{n}^{2}) .

Therefore,

\int_{B_{j}} | 𝔼 \hat{f_{n}} (x) - f (x) | = \int_{B_{j}} | - \sum_{i = 1}^{d} (x_{i} - c_{i}) f_{i} (c) + O (d^{2} h_{n}^{2}) | \leq \int_{B_{j}} | - \sum_{i = 1}^{d} (x_{i} - c_{i}) f_{i} (c) | + \int_{B_{j}} O (d^{2} h^{2}) = \int_{B_{j}} | \sum_{i = 1}^{d} (x_{i} - c_{i}) f_{i} (c) | + O (d^{2} h_{n}^{2 + d}) = O (d h_{n}^{d + 1}) + O (d^{2} h_{n}^{2 + d})

Since each bin is bounded, we can sum over all J bins. The number of bins is $J = h_{n}^{- d}$ by definition, so

\int | 𝔼 \hat{f_{n}} (x) - f (x) | d x = O (h_{n}^{- d}) (O (d h_{n}^{d + 1}) + O (d^{2} h_{n}^{2 + d})) = O (d h_{n}) + O (d^{2} h_{n}^{2}) .

We can now prove the main result of this section.

Proof of Theorem 3.1

Let g be the L¹ loss of the histogram estimator, $g = \int | f - \hat{f_{n}} |$ . Here $\hat{f_{n}} (x) = \frac{1}{n h_{n}^{d}} \sum_{i = 1}^{n} I (X_{i} \in B_{j} (x))$ where B_j(x) is the bin containing x. Let $\hat{f_{U}}$ , $\hat{f_{V}}$ , and $\hat{f_{U}}$ be histograms based on the block sequences U, V, and U′ respectively. Clearly $\hat{f_{n}} = \frac{1}{2} (\hat{f_{U}} + \hat{f_{V}})$ Now,

ℙ (g > ε) = ℙ (\int | f - \hat{f_{n}} | > ε) = ℙ (\int | \frac{f - \hat{f_{U}}}{2} + \frac{f - \hat{f_{V}}}{2} | > ε) \leq ℙ (\frac{1}{2} \int | f - \hat{f_{U}} | + \frac{1}{2} \int | f - \hat{f_{V}} | > ε) = ℙ (g_{U} + g_{V} > 2 ε) \leq ℙ (g_{U} > ε) + ℙ (g_{V} > ε) = 2 ℙ (g_{U} - 𝔼 [g_{U}] > ε - 𝔼 [g_{U}]) = 2 ℙ (g_{U} - 𝔼 [g_{U'}] > ε - 𝔼 [g_{U'}]) = 2 ℙ (g_{U} - 𝔼 [g_{U'}] > ε_{1}),

where ε₁ = ε − 𝔼[g_U′]. Here,

𝔼 [g_{U'}] \leq \tilde{𝔼} \int | \hat{f_{U'}} - \tilde{𝔼} \hat{f_{U'}} | d x + \int | \tilde{𝔼} \hat{f_{U'}} - f | d x,

so by Lemma 3.4, as long as for μ_n → ∞, h_n ↓ 0 and $μ_{n} h_{n}^{d} \to \infty$ , then for all ε there exists n₀(ε) such that for all n > n₀(ε), ε > 𝔼[g] = 𝔼[g_U′]. Now applying Lemma 3.2 to the expectation of the indicator of the event {g_U − 𝔼[g_U′] > ε₁} gives

2 ℙ (g_{U} - 𝔼 [g_{U'}] > ε_{1}) \leq 2 ℙ (g_{U'} - 𝔼 [g_{U'}] > ε_{1}) + 2 (μ_{n} - 1) β (m_{n})

where the probability on the right is for the σ-field generated by the independent block sequence U′. Since these blocks are independent, showing that g_U′ satisfies the bounded differences requirement allows for the application of McDiarmid’s inequality 3.3 to the blocks. For any two block sequences $u_{1}^{'}, \dots, u_{μ_{n}}^{'}$ and $ū_{1}^{'}, \dots, ū_{μ_{n}}^{'}$ with $u_{ℓ}^{'} = ū_{ℓ}^{'}$ for all ℓ ≠ j, then

| g_{U'} (u_{1}^{'}, \dots, u_{μ_{n}}^{'}) - g_{U'} (ū_{1}^{'}, \dots, ū_{μ_{n}}^{'}) | = | \int | \hat{f} (y; u_{1}^{'}, \dots, u_{μ_{n}}^{'}) - f (y) | d y - \int | \hat{f} (y; ū_{1}^{'}, \dots, ū_{μ_{n}}^{'}) - f (y) | d y | \leq \int | \hat{f} (y; u_{1}^{'}, \dots, u_{μ_{n}}^{'}) - \hat{f} (y; ū_{1}^{'}, \dots, ū_{μ_{n}}^{'}) | d y = \frac{2}{μ_{n} h_{n}^{d}} h_{n}^{d} = \frac{2}{μ_{n}} .

Therefore,

ℙ (g > ε) \leq 2 ℙ (g_{U'} - 𝔼 [g_{U'}] > ε_{1}) + 2 (μ_{n} - 1) β (m_{n}) \leq 2 exp {- \frac{μ_{n} ε_{1}^{2}}{2}} + 2 (μ_{n} - 1) β (m_{n}) .

4 Proofs

The proof of Theorem 2.4 relies on the triangle inequality and the relationship between total variation distance and the L¹ distance between densities.

Proof of Theorem 2.4

For any probability measures ν and λ defined on the same probability space with associated densities f_ν and f_λ with respect to some dominating measure π,

{‖ ν - λ ‖}_{T V} = \frac{1}{2} \int | f_{ν} - f_{λ} | d (π) .

Let P be the d-dimensional stationary distribution of the d^th order Markov process, i.e. $P = ℙ_{t - d + 1}^{t} = ℙ_{t + a}^{t + a + d - 1}$ in the notation of equation 3. Let ℙ_a,d be the joint distribution of the bivariate random process created by the initial process and itself separated by a time steps. By the triangle inequality, we can upper bound β^d(a) for any d = d_n. Let P̂ and ℙ̂_a,d be the distributions associated with histogram estimators f̂^d and ${\hat{f}}_{a}^{2 d}$ respectively. Then,

β^{d} (a) = {‖ P \otimes P - ℙ_{a, d} ‖}_{T V} = {‖ P \otimes P - \hat{P} \otimes \hat{P} + \hat{P} \otimes \hat{P} - {\hat{ℙ}}_{a, d} + {\hat{ℙ}}_{a, d} - ℙ_{a, d} ‖}_{T V} \leq {‖ P \otimes P - \hat{P} \otimes \hat{P} ‖}_{T V} + {‖ \hat{P} \otimes \hat{P} - {\hat{ℙ}}_{a, d} ‖}_{T V} + {‖ {\hat{ℙ}}_{a, d} - ℙ_{a, d} ‖}_{T V} \leq 2 {‖ P - \hat{P} ‖}_{T V} + {‖ \hat{P} \otimes \hat{P} - {\hat{ℙ}}_{a, d} ‖}_{T V} + {‖ {\hat{ℙ}}_{a, d} - ℙ_{a, d} ‖}_{T V} = \int | f^{d} - {\hat{f}}^{d} | + \frac{1}{2} \int | {\hat{f}}^{d} \otimes {\hat{f}}^{d} - {\hat{f}}_{a}^{2 d} | + \frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} |

where $\frac{1}{2} \int | {\hat{f}}^{d} \otimes {\hat{f}}^{d} - {\hat{f}}_{a}^{2 d} |$ is our estimator β̂^d(a) and the remaining terms are the L¹ distance between a density estimator and the target density. Thus,

β^{d} (a) - {\hat{β}}^{d} (a) \leq \int | f^{d} - {\hat{f}}^{d} | + \frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} | .

A similar argument starting from β^d(a) = ‖P ⊗ P − ℙ_a,d‖_TV shows that

β^{d} (a) - {\hat{β}}^{d} (a) \geq - \int | f^{d} - {\hat{f}}^{d} | - \frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} |,

so we have that

| β^{d} (a) - {\hat{β}}^{d} (a) | \leq \int | f^{d} - {\hat{f}}^{d} | + \frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} | .

Therefore,

ℙ (| β^{d} (a) - {\hat{β}}^{d} (a) | > ε) \leq ℙ (\int | f^{d} - {\hat{f}}^{d} | + \frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} | > ε) \leq ℙ (\int | f^{d} - {\hat{f}}^{d} | > \frac{ε}{2}) + ℙ (\frac{1}{2} \int | f_{a}^{2 d} - {\hat{f}}_{a}^{2 d} | > \frac{ε}{2}) \leq 2 exp {- \frac{μ_{n} ε_{1}^{2}}{2}} + 2 exp {- \frac{μ_{n} ε_{2}^{2}}{2}} + 4 (μ_{n} - 1) β (m_{n}),

where ε₁ = ε/2 − 𝔼 [∫|f̂^d − f^d|] and $ε_{2} = ε - 𝔼 [\int | {\hat{f}}_{a}^{2 d} - f_{a}^{2 d} |]$ .

The proof of Theorem 2.3 requires two steps which are given in the following Lemmas. The first specifies the histogram bandwidth h_n and the rate at which d_n (the dimensionality of the target density) goes to infinity. If the dimensionality of the target density were fixed, we could achieve rates of convergence similar to those for histograms based on IID inputs. However, we wish to allow the dimensionality to grow with n, so the rates are much slower as shown in the following lemma.

Lemma 4.1

For the histogram estimator in Lemma 3.4, let

d_{n} ~ exp {W (log n)},

h_{n} ~ n^{- k_{n}},

with

k_{n} = \frac{W (log n) + \frac{1}{2} log n}{log n (\frac{1}{2} exp {W (log n)} + 1)} .

These choices lead to the optimal rate of convergence.

Proof

Let h_n = n^−k_n for some k_n to be determined. Then we want $n^{- 1 / 2} h_{n}^{- d_{n} / 2} = n^{(k_{n} d_{n} - 1) / 2} \to 0$ , d_nh_n = d_nn^−k → 0, and $d_{n}^{2} h_{n}^{2} = d_{n}^{2} n^{- 2 k} \to 0$ all as n → ∞. Call these A, B, and C. Taking A and B first gives

n^{(k_{n} d_{n} - 1) / 2} ~ d_{n} n^{- k_{n}} \Rightarrow \frac{1}{2} (k_{n} d_{n} - 1) log n ~ log d_{n} - k_{n} log n \Rightarrow k_{n} log n (\frac{1}{2} d_{n} + 1) ~ log d_{n} + \frac{1}{2} log n \Rightarrow k_{n} ~ \frac{log d_{n} + \frac{1}{2} log n}{log n (\frac{1}{2} d_{n} + 1)} .

(9)

Similarly, combining A and C gives

k_{n} ~ \frac{2 log d_{n} + \frac{1}{2} log n}{log n (\frac{1}{2} d_{n} + 2)} .

(10)

Equating (9) and (10) and solving for d_n gives

\Rightarrow d_{n} ~ exp {W (log n)}

where W(·) is the Lambert W function. Plugging back into (9) gives that

h_{n} = n^{- k_{n}}

where

k_{n} = \frac{W (log n) + \frac{1}{2} log n}{log n (\frac{1}{2} exp {W (log n)} + 1)} .

It is also necessary to show that as d grows, β^d(a) → β(a). We now prove this result.

Lemma 4.2

β^d(a) converges to β(a) as d → ∞.

Proof

By stationarity, the supremum over t is unnecessary in Definition 2.1, so without loss of generality, let t = 0. Let $ℙ_{- \infty}^{0}$ be the distribution on $σ_{- \infty}^{0} = σ (\dots, X_{- 1}, X_{0})$ , and let $ℙ_{a}^{\infty}$ be the distribution on $σ_{a + 1}^{\infty} = σ (X_{a + 1}, X_{a + 2}, \dots)$ . Let ℙ_a be the distribution on $σ = σ_{- \infty}^{0} \otimes σ_{a + 1}^{\infty}$ (the product sigma-field). Then we can rewrite Definition 2.1 using this notation as

β (a) = sup_{C \in σ} | ℙ_{a} (C) - [ℙ_{- \infty}^{0} \otimes ℙ_{a + 1}^{\infty}] (C) | .

Let $σ_{- d + 1}^{0}$ and $σ_{a + 1}^{a + d}$ be the sub-σ-fields of $σ_{- \infty}^{0}$ and $σ_{a + 1}^{\infty}$ consisting of the d-dimensional cylinder sets for the d dimensions closest together. Let σ^d be the product σ-field of these two. Then we can rewrite β^d(a) as

β^{d} (a) = sup_{C \in σ^{d}} ‖ ℙ_{a} (C) - [ℙ_{- \infty}^{0} \otimes ℙ_{a + 1}^{\infty}] (C) | .

(11)

As such β^d(a) ≤ β(a) for all a and d. We can rewrite (11) in terms of finite-dimensional marginals:

β^{d} (a) = sup_{C \in σ^{d}} | ℙ_{a, d} (C) - [ℙ_{- d + 1}^{0} \otimes ℙ_{a + 1}^{a + d}] (C) |,

where ℙ_a,d is the restriction of ℙ to σ(X_−d+1,…, X₀, X_a+1,…, X_a+d). Because of the nested nature of these sigma-fields, we have

β^{d_{1}} (a) \leq β^{d_{2}} (a) \leq β (a)

for all finite d₁ ≤ d₂. Therefore, for fixed ${β^{d} (a)}_{d = 1}^{\infty}$ is a monotone increasing sequence which is bounded above, and it converges to some limit L ≤ β(a). To show that L = β(a) requires some additional steps.

Let $R = ℙ_{a} - [ℙ_{- \infty}^{0} \otimes ℙ_{a}^{\infty}]$ , which is a signed measure on σ. Let $R^{d} = ℙ_{a, d} - [ℙ_{- d}^{0} \otimes ℙ_{a}^{a + d}]$ , which is a signed measure on σ^d. Decompose R into positive and negative parts as R = Q⁺ − Q⁻ and similarly for R^d = Q^+d − Q^−d. Notice that since R^d is constructed using the marginals of ℙ, then R(E) = R^d(E) for all E ∈ σ^d. Now since R is the difference of probability measures, we must have that

0 = R (Ω) = Q^{+} (Ω) - Q^{-} (Ω) = Q^{+} (D) + Q^{+} (D^{c}) - Q^{-} (D) - Q^{-} (D^{c})

(12)

for all D ∈ σ.

Define Q = Q⁺ + Q⁻. Let ε > 0. Let C ∈ σ be such that

Q (C) = β (a) = Q^{+} (C) = Q^{-} (C^{c}) .

(13)

Such a set C is guaranteed by the Hahn decomposition theorem (letting C* be a set which attains the supremum in (11), we can throw away any subsets with negative R measure) and (12) assuming without loss of generality that $ℙ_{a} (C) > [ℙ_{- \infty}^{0} \otimes ℙ_{a}^{\infty}] (C)$ . We can use the field σ_f = ∪_d σ^d to approximate σ in the sense that, for all ε, we can find A ∈ σ_f such that Q(AΔC) < ε/2 (see Theorem D in Halmos [11, §13] or Lemma A.24 in Schervish [21]). Now,

Q (A Δ C) = Q (A \cap C^{c}) + Q (C \cap A^{c}) = Q^{-} (A \cap C^{c}) + Q^{+} (C \cap A^{c})

by (13) since A∩C^c ⊆ C^c and C∩A^c ⊆ C. Therefore, since Q(AΔC) < ε/2, we have

Q^{-} (A \cap C^{c}) \leq ε / 2

(14)

Q^{+} (A^{c} \cap C) \leq ε / 2 .

Also,

Q (C) = Q (A \cap C) + Q (A^{c} \cap C) = Q^{+} (A \cap C) + Q^{+} (A^{c} \cap C) \leq Q^{+} (A) + ε / 2

since A∩C and A^c∩C are contained in C and A∩C ⊆ A. Therefore

Q^{+} (A) \geq Q (C) - ε / 2 .

Similarly,

Q^{-} (A) = Q^{-} (A \cap C) + Q^{-} (A \cap C^{c}) \leq 0 + ε / 2 = ε / 2

since A ∩ C ⊆ C and Q⁻(C) = 0 by (14). Finally,

Q^{+ d} (A) \geq Q^{+ d} (A) - Q^{- d} (A) = R^{d} (A) = R (A) = Q^{+} (A) - Q^{-} (A) \geq Q (C) - ε / 2 - ε / 2 = Q (C) - ε = β (a) - ε .

And since β^d(a) ≥ Q^+d(A), we have that for all ε > 0 there exists d such that for all d₁ > d,

β^{d_{1}} (a) \geq β^{d} (a) \geq Q^{+ d} (A) \geq β (a) - ε .

Thus, we must have that L = β(a), so that β^d(a) → β(a) as desired.

Proof of Theorem 2.3

By the triangle inequality,

| {\hat{β}}^{d_{n}} (a) - β (a) | \leq | {\hat{β}}^{d_{n}} (a) - β^{d_{n}} (a) | + | β^{d_{n}} (a) - β (a) | .

The first term on the right is bounded by the result in Theorem 2.4, where we have shown that d_n = O(exp{W(log n)}) is slow enough for the histogram estimator to remain consistent. That $β^{d_{n}} (a) \overset{d_{n} \to \infty}{\to} β (a)$ follows from Lemma 4.2.

5 Discussion

We have shown that our estimator of the β-mixing coefficients is consistent for the true coefficients β(a) under some conditions on the data generating process. There are numerous results in the statistics and machine learning literatures which assume knowledge of the β-mixing coefficients, yet as far as we know, this is the first estimator for them. An ability to estimate these coefficients will allow researchers to apply existing results to dependent data without the need to arbitrarily assume their values. Despite the obvious utility of this estimator, as a consequence of its novelty, it comes with a number of potential extensions which warrant careful exploration as well as some drawbacks.

The reader will note that Theorem 2.3 does not provide a convergence rate. The rate in Theorem 2.4 applies only to the difference between β̂^d(a) and β^d(a). In order to provide a rate in Theorem 2.3, we would need a better understanding of the non-stochastic convergence of β^d(a) to β(a). It is not immediately clear that this quantity can converge at any well-defined rate. In particular, it seems likely that the rate of convergence depends on the tail of the sequence ${β (a)}_{a = 1}^{\infty}$ .

Several other mixing and weak-dependence coefficients also have a total-variation flavor, perhaps most notably α-mixing [9, 7, 4]. None of them have estimators, and the same trick might well work for them, too.

The use of histograms rather than kernel density estimators for the joint and marginal densities is surprising and perhaps not ultimately necessary. As mentioned above, Tran [23] proved that KDEs are consistent for estimating the stationary density of a time series with β-mixing inputs, so perhaps one could replace the histograms in our estimator with KDEs. However, this would need an analogue of the double asymptotic results proven for histograms in Lemma 3.4. In particular, we need to estimate increasingly higher dimensional densities as n → ∞. This does not cause a problem of small-n-large-d since d is chosen as a function of n, however it will lead to increasingly higher dimensional integration. For histograms, the integral is always trivial, but in the case of KDEs, the numerical accuracy of the integration algorithm becomes increasingly important. This issue could swamp any efficiency gains obtained through the use of kernels. However, this question certainly warrants further investigation.

The main drawback of an estimator based on a density estimate is its complexity. The mixing coefficients are functionals of the joint and marginal distributions derived from the stochastic process X, however, it is unsatisfying to estimate densities and solve integrals in order to estimate a single number. Vapnik’s main principle for solving problems using a restricted amount of information is

When solving a given problem, try to avoid solving a more general problem as an intermediate step [24, p. 30].

This principle is clearly violated here, but perhaps our seed will precipitate a more aesthetically pleasing solution.

Acknowledgements

The authors wish to thank Darren Homrighausen and two anonymous reviewers for helpful comments and the Institute for New Economic Thinking for supporting this research.

Footnotes

Appearing in Proceedings of the 14^th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA.

The literature on algorithmic stability refers to this as β-stability (e.g. Bousquet and Elisseeff [3]).

While it is clearly possible to replace histograms with other choices of density estimators (most notably kernel density estimators), histograms in this case are more con- venient theoretically and computationally. See §5 for more details.

The Lambert W function is defined as the (multivalued) inverse of f(w) = w exp{w}. Thus, O(exp{W(log n)}) is bigger than O(log log n) but smaller than O(log n). See for example Corless et al. [6].

Contributor Information

Daniel J. McDonald, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, danielmc@stat.cmu.edu

Cosma Rohilla Shalizi, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, cshalizi@stat.cmu.edu.

Mark Schervish, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, mark@cmu.edu.

References

1.Baraud Y, Comte F, Viennet G. Adaptive estimation in autoregression or β-mixing regression via model selection. Annals of statistics. 2001;29:839–875. [Google Scholar]
2.Bickel P, Rosenblatt M. On Some Global Measures of the Deviations of Density Function Estimates. The Annals of Statistics. 1973;1:1071–1095. [Google Scholar]
3.Bousquet O, Elisseeff A. Stability and Generalization. The Journal of Machine Learning Research. 2002;2:499–526. [Google Scholar]
4.Bradley RC. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys. 2005;2:107–144. [Google Scholar]
5.Carrasco M, Chen X. Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models. Econometric Theory. 2002;18:17–39. [Google Scholar]
6.Corless R, Gonnet G, Hare D, Jeffrey D, Knuth D. On the Lambert W Function. Advances in Computational Mathematics. 1996;5:329–359. [Google Scholar]
7.Dedecker J, Doukhan P, Lang G, Leon R, J R, Louhichi S, Prieur C. Weak Dependence: With Examples and Applications, vol. 190 of Lecture Notes in Statistics. New York: Springer Verlag; 2007. [Google Scholar]
8.Devroye L, Györfi L. Nonparametric Density Estimation: The L1 View. New York: Wiley; 1985. [Google Scholar]
9.Doukhan P. Mixing: Properties and Examples, vol. 85 of Lecture Notes in Statistics. New York: Springer Verlag; 1994. [Google Scholar]
10.Freedman D, Diaconis P. On the Maximum Deviation Between the Histogram and the Underlying Density. Probability Theory and Related Fields. 1981;58:139–167. [Google Scholar]
11.Halmos P. Measure Theory, Graduate Texts in Mathematics. New York: Springer-Verlag; 1974. [Google Scholar]
12.Karandikar RL, Vidyasagar M. Probably Approximately Correct Learning with Beta-Mixing Input Sequences. 2009 submitted for publication. [Google Scholar]
13.Lozano A, Kulkarni S, Schapire R. Convergence and Consistency of Regularized Boosting Algorithms with Stationary Beta-Mixing Observations. Advances in Neural Information Processing Systems. 2006;18:819. [Google Scholar]
14.McDiarmid C. On the Method of Bounded Differences. In: Siemons J, editor. Surveys in Combinatorics. Cambridge University Press; 1989. pp. 148–188. vol. 141 of London Mathematical Society Lecture Note Series. [Google Scholar]
15.Meir R. Nonparametric Time Series Prediction Through Adaptive Model Selection. Machine Learning. 2000;39:5–34. [Google Scholar]
16.Mohri M, Rostamizadeh A. Stability Bounds for Stationary φ-mixing and β-mixing Processes. Journal of Machine Learning Research. 2010;11:789–814. [Google Scholar]
17.Mokkadem A. Mixing properties of ARMA processes. Stochastic processes and their applications. 1988;29:309–315. [Google Scholar]
18.Nobel A. Hypothesis Testing for Families of Ergodic Processes. Bernoulli. 2006;12:251–269. [Google Scholar]
19.Nummelin E, Tuominen P. Geometric Ergodicity of Harris Recurrent Markov Chains with Applications to Renewal Theory. Stochastic Processes and Their Applications. 1982;12:187–202. [Google Scholar]
20.Ralaivola L, Szafranski M, Stempfel G. Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes. Journal of Machine Learning Research. 2010;11:1927–1956. [Google Scholar]
21.Schervish M. Theory of Statistics, Springer Series in Statistics. New York: Springer Verlag; 1995. [Google Scholar]
22.Silverman B. Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivatives. The Annals of Statistics. 1978;6:177–184. [Google Scholar]
23.Tran L. The L1 Convergence of Kernel Density Estimates under Dependence. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 1989;17:197–208. [Google Scholar]
24.Vapnik V. Statistics for Engineering and Information Science. 2nd edn. New York: Springer Verlag; 2000. The Nature of Statistical Learning Theory. [Google Scholar]
25.Vidyasagar M. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Berlin: Springer Verlag; 1997. [Google Scholar]
26.Woodroofe M. On the Maximum Deviation of the Sample Density. The Annals of Mathematical Statistics. 1967;38:475–481. [Google Scholar]
27.Yu B. Density Estimation in the L∞ Norm for Dependent Data with Applications to the Gibbs Sampler. Annals of Statistics. 1993;21:711–735. [Google Scholar]
28.Yu B. Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability. 1994;22:94–116. [Google Scholar]

[R1] 1.Baraud Y, Comte F, Viennet G. Adaptive estimation in autoregression or β-mixing regression via model selection. Annals of statistics. 2001;29:839–875. [Google Scholar]

[R2] 2.Bickel P, Rosenblatt M. On Some Global Measures of the Deviations of Density Function Estimates. The Annals of Statistics. 1973;1:1071–1095. [Google Scholar]

[R3] 3.Bousquet O, Elisseeff A. Stability and Generalization. The Journal of Machine Learning Research. 2002;2:499–526. [Google Scholar]

[R4] 4.Bradley RC. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys. 2005;2:107–144. [Google Scholar]

[R5] 5.Carrasco M, Chen X. Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models. Econometric Theory. 2002;18:17–39. [Google Scholar]

[R6] 6.Corless R, Gonnet G, Hare D, Jeffrey D, Knuth D. On the Lambert W Function. Advances in Computational Mathematics. 1996;5:329–359. [Google Scholar]

[R7] 7.Dedecker J, Doukhan P, Lang G, Leon R, J R, Louhichi S, Prieur C. Weak Dependence: With Examples and Applications, vol. 190 of Lecture Notes in Statistics. New York: Springer Verlag; 2007. [Google Scholar]

[R8] 8.Devroye L, Györfi L. Nonparametric Density Estimation: The L1 View. New York: Wiley; 1985. [Google Scholar]

[R9] 9.Doukhan P. Mixing: Properties and Examples, vol. 85 of Lecture Notes in Statistics. New York: Springer Verlag; 1994. [Google Scholar]

[R10] 10.Freedman D, Diaconis P. On the Maximum Deviation Between the Histogram and the Underlying Density. Probability Theory and Related Fields. 1981;58:139–167. [Google Scholar]

[R11] 11.Halmos P. Measure Theory, Graduate Texts in Mathematics. New York: Springer-Verlag; 1974. [Google Scholar]

[R12] 12.Karandikar RL, Vidyasagar M. Probably Approximately Correct Learning with Beta-Mixing Input Sequences. 2009 submitted for publication. [Google Scholar]

[R13] 13.Lozano A, Kulkarni S, Schapire R. Convergence and Consistency of Regularized Boosting Algorithms with Stationary Beta-Mixing Observations. Advances in Neural Information Processing Systems. 2006;18:819. [Google Scholar]

[R14] 14.McDiarmid C. On the Method of Bounded Differences. In: Siemons J, editor. Surveys in Combinatorics. Cambridge University Press; 1989. pp. 148–188. vol. 141 of London Mathematical Society Lecture Note Series. [Google Scholar]

[R15] 15.Meir R. Nonparametric Time Series Prediction Through Adaptive Model Selection. Machine Learning. 2000;39:5–34. [Google Scholar]

[R16] 16.Mohri M, Rostamizadeh A. Stability Bounds for Stationary φ-mixing and β-mixing Processes. Journal of Machine Learning Research. 2010;11:789–814. [Google Scholar]

[R17] 17.Mokkadem A. Mixing properties of ARMA processes. Stochastic processes and their applications. 1988;29:309–315. [Google Scholar]

[R18] 18.Nobel A. Hypothesis Testing for Families of Ergodic Processes. Bernoulli. 2006;12:251–269. [Google Scholar]

[R19] 19.Nummelin E, Tuominen P. Geometric Ergodicity of Harris Recurrent Markov Chains with Applications to Renewal Theory. Stochastic Processes and Their Applications. 1982;12:187–202. [Google Scholar]

[R20] 20.Ralaivola L, Szafranski M, Stempfel G. Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes. Journal of Machine Learning Research. 2010;11:1927–1956. [Google Scholar]

[R21] 21.Schervish M. Theory of Statistics, Springer Series in Statistics. New York: Springer Verlag; 1995. [Google Scholar]

[R22] 22.Silverman B. Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivatives. The Annals of Statistics. 1978;6:177–184. [Google Scholar]

[R23] 23.Tran L. The L1 Convergence of Kernel Density Estimates under Dependence. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 1989;17:197–208. [Google Scholar]

[R24] 24.Vapnik V. Statistics for Engineering and Information Science. 2nd edn. New York: Springer Verlag; 2000. The Nature of Statistical Learning Theory. [Google Scholar]

[R25] 25.Vidyasagar M. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Berlin: Springer Verlag; 1997. [Google Scholar]

[R26] 26.Woodroofe M. On the Maximum Deviation of the Sample Density. The Annals of Mathematical Statistics. 1967;38:475–481. [Google Scholar]

[R27] 27.Yu B. Density Estimation in the L∞ Norm for Dependent Data with Applications to the Gibbs Sampler. Annals of Statistics. 1993;21:711–735. [Google Scholar]

[R28] 28.Yu B. Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability. 1994;22:94–116. [Google Scholar]

PERMALINK

Estimating beta-mixing coefficients

Daniel J McDonald

Cosma Rohilla Shalizi

Mark Schervish

Abstract

1 Introduction

Theorem 1.1

2 Estimation of β-mixing

2.1 Definitions

Definition 2.1

Definition 2.2

2.2 Results

Theorem 2.3

Theorem 2.4

Corollary 2.5

Proof

3 L1 convergence of histograms

Theorem 3.1

Lemma 3.2

Lemma 3.3

Lemma 3.4

Proof of Lemma 3.4

Proof of Theorem 3.1

4 Proofs

Proof of Theorem 2.4

Lemma 4.1

Proof

Lemma 4.2

Proof

Proof of Theorem 2.3

5 Discussion

Acknowledgements

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3 L¹ convergence of histograms