Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Aug 14.
Published in final edited form as: JMLR Workshop Conf Proc. 2011;15:516–524.

Estimating beta-mixing coefficients

Daniel J McDonald 1, Cosma Rohilla Shalizi 2, Mark Schervish 3
PMCID: PMC4537076  NIHMSID: NIHMS697610  PMID: 26279742

Abstract

The literature on statistical learning for time series assumes the asymptotic independence or “mixing” of the data-generating process. These mixing assumptions are never tested, and there are no methods for estimating mixing rates from data. We give an estimator for the beta-mixing rate based on a single stationary sample path and show it is L1-risk consistent.

1 Introduction

Relaxing the assumption of independence is an active area of research in the statistics and machine learning literature. For time series, independence is replaced by the asymptotic independence of events far apart in time, or “mixing”. Mixing conditions make the dependence of the future on the past explicit by quantifying the decay in dependence as the future moves farther from the past. There are many definitions of mixing of varying strength with matching dependence coefficients (see [9, 7, 4] for reviews), but most of the results in the learning literature focus on β-mixing or absolute regularity. Roughly speaking (see Definition 2.1 below for a precise statement), the β-mixing coefficient at lag a is the total variation distance between the actual joint distribution of events separated by a time steps and the product of their marginal distributions, i.e., the L1 distance from independence.

Numerous results in the statistical machine learning literature rely on knowledge of the β-mixing coefficient. As Vidyasagar [25, p. 41] notes, β-mixing is “just right” for the extension of IID results to dependent data, and so recent work has consistently focused on it. Meir [15] derives generalization error bounds for nonparametric methods based on model selection via structural risk minimization. Baraud et al. [1] study the finite sample risk performance of penalized least squares regression estimators under β-mixing. Lozano et al. [13] examine regularized boosting algorithms under absolute regularity and prove consistency. Karandikar and Vidyasagar [12] consider “probably approximately correct” learning algorithms, proving that PAC algorithms for IID inputs remain PAC with β-mixing inputs under some mild conditions. Ralaivola et al. [20] derive PAC bounds for ranking statistics and classifiers using a decomposition of the dependency graph. Finally, Mohri and Rostamizadeh [16] derive stability bounds for β-mixing inputs, generalizing existing stability results for IID data.

All these results assume not just β-mixing, but known mixing coefficients. In particular, the risk bounds in [15, 16] and [20] are incalculable without knowledge of the rates. This knowledge is never available. Unless researchers are willing to assume specific values for a sequence of β-mixing coefficients, the results mentioned in the previous paragraph are generally useless when confronted with data. To illustrate this deficiency, consider Theorem 18 of [16]:

Theorem 1.1

(Briey). Assume a learning algorithm is λ-stable.1 Then, for any sample of size n drawn from a stationary β-mixing distribution, and ε > 0

(|RR^|>ε)Γ(n,λ,ε,a,b)+β(a)(μn1)

where n = (a + bn, Γ has a particular functional form, and R − R̂ is the difference between the true risk and the empirical risk.

Ideally, one could use this result for model selection or to control the size of the generalization error of competing prediction algorithms (support vector machines, support vector regression, and kernel ridge regression are a few of the many algorithms known to satisfy λ-stability). However the bound depends explicitly on the mixing coefficient β(a). To make matters worse, there are no methods for estimating the β-mixing coefficients. According to Meir [15, p. 7], “there is no efficient practical approach known at this stage for estimation of mixing parameters.” We begin to rectify this problem by deriving the first method for estimating these coefficients. We prove that our estimator is consistent for arbitrary β-mixing processes. In addition, we derive rates of convergence for Markov approximations to these processes.

Application of statistical learning results to β-mixing data is highly desirable in applied work. Many common time series models are known to be β-mixing, and the rates of decay are known given the true parameters of the process. Among the processes for which such knowledge is available are ARMA models [17], GARCH models [5], and certain Markov processes — see [9] for an overview of such results. To our knowledge, only Nobel [18] approaches a solution to the problem of estimating mixing rates by giving a method to distinguish between different polynomial mixing rate regimes through hypothesis testing.

We present the first method for estimating the β-mixing coefficients for stationary time series data. Section 2 defines the β-mixing coefficient and states our main results on convergence rates and consistency for our estimator. Section 3 gives an intermediate result on the L1 convergence of the histogram estimator with β-mixing inputs. Section 4 proves the main results from §2. Section 5 concludes and lays out some avenues for future research.

2 Estimation of β-mixing

In this section, we present one of many equivalent definitions of absolute regularity and state our main results, deferring proof to §4.

To fix notation, let X={Xt}t= be a sequence of random variables where each Xt is a measurable function from a probability space (Ω, , ℙ) into a measurable space 𝒳. A block of this random sequence will be given by Xij{Xt}t=ij where i and j are integers, and may be infinite. We use similar notation for the sigma fields generated by these blocks and their joint distributions. In particular, σij will denote the sigma field generated by Xij, and the joint distribution of Xij will be denoted ij.

2.1 Definitions

There are many equivalent definitions of β-mixing (see for instance [9], or [4] as well as Meir [15] or Yu [28]), however the most intuitive is that given in Doukhan [9].

Definition 2.1

(β-mixing). For each positive integer a, the coefficient of absolute regularity, or β-mixing coefficient, β(a), is

β(a)supttt+at,aTV (1)

where ‖·‖TV is the total variation norm, andt,a is the joint distribution of (Xt,Xt+a). A stochastic process is said to be absolutely regular, or β-mixing, if β(a) → 0 as a → ∞.

Loosely speaking, Definition 2.1 says that the coefficient β(a) measures the total variation distance between the joint distribution of random variables separated by a time units and a distribution under which random variables separated by a time units are independent. The supremum over t is unnecessary for stationary random processes X which is the only case we consider here.

Definition 2.2

(Stationarity). A sequence of random variables X is stationary when all its finite-dimensional distributions are invariant over time: for all t and all non-negative integers i and j, the random vectors Xtt+i and Xt+jt+i+j have the same distribution.

Our main result requires the method of blocking used by Yu [27, 28]. The purpose is to transform a sequence of dependent variables into subsequence of nearly IID ones. Consider a sample X1n from a stationary β-mixing sequence with density f. Let mn and μn be non-negative integers such that 2mnμn = n. Now divide X1n into 2μn blocks, each of length mn. Identify the blocks as follows:

Uj={Xi:2(j1)mn+1i(2j1)mn},
Vj={Xi:(2j1)mn+1i2jmn}.

Let U be the entire sequence of odd blocks Uj, and let V be the sequence of even blocks Vj. Finally, let U′ be a sequence of blocks which are independent of X1n but such that each block has the same distribution as a block from the original sequence:

Uj=DUj=DU1. (2)

The blocks U′ are now an IID block sequence, so standard results apply. (See [28] for a more rigorous analysis of blocking.) With this structure, we can state our main result.

2.2 Results

Our main result emerges in two stages. First, we recognize that the distribution of a finite sample depends only on finite-dimensional distributions. This leads to an estimator of a finite-dimensional version of β(a). Next, we let the finite-dimension increase to infinity with the size of the observed sample.

For positive integers t, d, and a, define

βd(a)td+1tt+at+a+d1t,a,dTV, (3)

where ℙt,a,d is the joint distribution of (Xtd+1t,Xt+at+a+d1). Also, let d be the d-dimensional histogram estimator of the joint density of d consecutive observations, and let f^a2d be the 2d-dimensional histogram estimator of the joint density of two sets of d consecutive observations separated by a time points.

We construct an estimator of βd(a) based on these two histograms.2 Define

β^d(a)12|f^a2df^df^d| (4)

We show that, by allowing d = dn to grow with n, this estimator will converge on β(a). This can be seen most clearly by bounding the ℓ1-risk of the estimator with its estimation and approximation errors:

|β^d(a)β(a)||β^d(a)βd(a)|+|βd(a)β(a)|.

The first term is the error of estimating βd(a) with a random sample of data. The second term is the non-stochastic error induced by approximating the infinite dimensional coefficient, β(a), with its d-dimensional counterpart, βd(a).

Our first theorem in this section establishes consistency of β̂dn (a) as an estimator of β(a) for all β-mixing processes provided dn increases at an appropriate rate. Theorem 2.4 gives finite sample bounds on the estimation error while some measure theoretic arguments contained in §4 show that the approximation error must go to zero as dn → ∞.

Theorem 2.3

Let X1n be a sample from an arbitrary β-mixing process. Let dn = O(exp{W(log n)}) where W is the Lambert W function.3 Then β^dn(a)Pβ(a) as n → ∞.

A finite sample bound for the approximation error is the first step to establishing consistency for β̂d(a). This result gives convergence rates for estimation of the finite dimensional mixing coefficient βd(a) and also for Markov processes of known order d, since in this case, βd(a) = β(a).

Theorem 2.4

Consider a sample X1n from a stationary β-mixing process. Let μn and mn be positive integers such thatnmn = n and μnd > 0. Then

(|β^d(a)βd(a)|>ε)2exp{μnε122}+2exp{μnε222}+4(μn1)β(mn),

where ε1 = ε/2 − 𝔼 [∫|dfd|] and ε2=ε𝔼[|f^a2dfa2d|].

Consistency of the estimator β̂d(a) is guaranteed only for certain choices of mn and μn. Clearly μn → ∞ and μnβ(mn) → 0 as n → ∞ are necessary conditions. Consistency also requires convergence of the histogram estimators to the target densities. We leave the proof of this theorem for section 4. As an example to show that this bound can go to zero with proper choices of mn and μn, the following corollary proves consistency for first order Markov processes. Consistency of the estimator for higher order Markov processes can be proven similarly. These processes are geometrically β-mixing as shown in e.g. Nummelin and Tuominen [19].

Corollary 2.5

Let X1n be a sample from a first order Markov process with β(a) = β1(a) = O(ra) for some 0 ≤ r < 1. Then under the conditions of Theorem 2.4, β^1(a)Pβ(a) at a rate of o(n) up to a logarithmic factor.

Proof

Recall that n = 2μnmn. Then,

4(μn1)β(mn)=4μnβ(mn)+4β(mn)=K1nmnrmn+K2rmn0

if mn = Ω(log n) for constants K1 and K2. But the exponential terms are

exp{K3nεj2mn}

for j = 1, 2 and a constant K3. Therefore, both exponential terms go to 0 as n → ∞ for mn = o(n). Balancing the rates gives the optimal choice of mn=o(n) with corresponding rate of convergence (up to a logarithmic factor) of o(n).

Proving Theorem 2.4 requires showing the L1 convergence of the histogram density estimator with β-mixing data. We do this in the next section.

3 L1 convergence of histograms

Convergence of density estimators is thoroughly studied in the statistics and machine learning literature. Early papers on the L convergence of kernel density estimators (KDEs) include [26, 2, 22]; Freedman and Diaconis [10] look specifically at histogram estimators, and Yu [27] considered the L convergence of KDEs for β-mixing data and shows that the optimal IID rates can be attained. Devroye and Györfi [8] argue that L1 is a more appropriate metric for studying density estimation, and Tran [23] proves L1 consistency of KDEs under α- and β-mixing. As far as we are aware, ours is the first proof of L1 convergence for histograms under β-mixing.

Additionally, the dimensionality of the target density is analogous to the order of the Markov approximation. Therefore, the convergence rates we give are asymptotic in the bandwidth hn which shrinks as n increases, but also in the dimension d which increases with n. Even under these asymptotics, histogram estimation in this sense is not a high dimensional problem. The dimension of the target density considered here is on the order of exp{W(log n)}, a rate somewhere between log n and log log n.

Theorem 3.1

If f̂ is the histogram estimator based on a (possibly vector valued) sample X1n from a β-mixing sequence with stationary density f, then for all ε > 𝔼 [∫|f|],

(|f^f|>ε)2exp{μnε122}+2(μn1)β(mn) (5)

where ε1 = ε − 𝔼 [∫|f|].

To prove this result, we use the blocking method of Yu [28] to transform the dependent β-mixing into a sequence of nearly independent blocks. We then apply McDiarmid’s inequality to the blocks to derive asymptotics in the bandwidth of the histogram as well as the dimension of the target density. For completeness, we state Yu’s blocking result and McDiarmid’s inequality before proving the doubly asymptotic histogram convergence for IID data. Combining these lemmas allows us to derive rates of convergence for histograms based on β-mixing inputs.

Lemma 3.2

(Lemma 4.1 in Yu [28]). Let ϕ be a measurable function with respect to the block sequence U uniformly bounded by M. Then,

|𝔼[ϕ]𝔼˜[ϕ]|Mβ(mn)(μn1), (6)

where the first expectation is with respect to the dependent block sequence, U, and 𝔼̃ is with respect to the independent sequence, U′.

This lemma essentially gives a method of applying IID results to β-mixing data. Because the dependence decays as we increase the separation between blocks, widely spaced blocks are nearly independent of each other. In particular, the difference between expectations over these nearly independent blocks and expectations over blocks which are actually independent can be controlled by the β-mixing coefficient.

Lemma 3.3

(McDiarmid Inequality [14]). Let X1,…, Xn be independent random variables, with Xi taking values in a set Ai for each i. Suppose that the measurable function f : ∏ Ai → ℝ satisfies

|f(x)f(x)|ci

whenever the vectors x and xdiffer only in the ith coordinate. Then for any ε > 0,

(f𝔼f>ε)exp{2ε2ci2}.

Lemma 3.4

For an IID sample X1,…, Xn from some density f ond

𝔼|f^𝔼f^|dx=O(1/nhnd) (7)
|𝔼f^f|dx=O(dhn)+O(d2hn2), (8)

where f̂ is the histogram estimate using a grid with sides of length hn.

Proof of Lemma 3.4

Let pj be the probability of falling into the jth bin Bj. Then,

𝔼|f^𝔼f^|=hndj=1J𝔼|1nhndi=1nI(XiBj)pjhd|hndj=1J1nhnd𝕍[i=1nI(XiBj)]=hndj=1J1nhndnpj(1pj)=1nj=1Jpj(1pj)=O(n1/2)O(hnd/2)=O(1/nhnd).

For the second claim, consider the bin Bj centered at c. Let I be the union of all bins Bj. Assume the following:

  1. fL2 and f is absolutely continuous on I, with a.e. partial derivatives fi=yif(y)

  2. fiL2 and fi is absolutely continuous on I, with a.e. partial derivatives fik=ykfi(y)

  3. fikL2 for all i, k.

Using a Taylor expansion

f(x)=f(c)+i=1d(xici)fi(c)+O(d2hn2),

where fi(y)=yif(y). Therefore, pj is given by

pj=Bjf(x)dx=hndf(c)+O(d2hnd+2)

since the integral of the second term over the bin is zero. This means that for the jth bin,

𝔼fn^(x)f(x)=pjhndf(x)=i=1d(xici)fi(c)+O(d2hn2).

Therefore,

Bj|𝔼fn^(x)f(x)|=Bj|i=1d(xici)fi(c)+O(d2hn2)|Bj|i=1d(xici)fi(c)|+BjO(d2h2)=Bj|i=1d(xici)fi(c)|+O(d2hn2+d)=O(dhnd+1)+O(d2hn2+d)

Since each bin is bounded, we can sum over all J bins. The number of bins is J=hnd by definition, so

|𝔼fn^(x)f(x)|dx=O(hnd)(O(dhnd+1)+O(d2hn2+d))=O(dhn)+O(d2hn2).

We can now prove the main result of this section.

Proof of Theorem 3.1

Let g be the L1 loss of the histogram estimator, g=|ffn^|. Here fn^(x)=1nhndi=1nI(XiBj(x)) where Bj(x) is the bin containing x. Let fU^, fV^, and fU^ be histograms based on the block sequences U, V, and U′ respectively. Clearly fn^=12(fU^+fV^) Now,

(g>ε)=(|ffn^|>ε)=(|ffU^2+ffV^2|>ε)(12|ffU^|+12|ffV^|>ε)=(gU+gV>2ε)(gU>ε)+(gV>ε)=2(gU𝔼[gU]>ε𝔼[gU])=2(gU𝔼[gU]>ε𝔼[gU])=2(gU𝔼[gU]>ε1),

where ε1 = ε − 𝔼[gU]. Here,

𝔼[gU]𝔼˜|fU^𝔼˜fU^|dx+|𝔼˜fU^f|dx,

so by Lemma 3.4, as long as for μn → ∞, hn ↓ 0 and μnhnd, then for all ε there exists n0(ε) such that for all n > n0(ε), ε > 𝔼[g] = 𝔼[gU]. Now applying Lemma 3.2 to the expectation of the indicator of the event {gU − 𝔼[gU] > ε1} gives

2(gU𝔼[gU]>ε1)2(gU𝔼[gU]>ε1)+2(μn1)β(mn)

where the probability on the right is for the σ-field generated by the independent block sequence U′. Since these blocks are independent, showing that gU satisfies the bounded differences requirement allows for the application of McDiarmid’s inequality 3.3 to the blocks. For any two block sequences u1,,uμn and ū1,,ūμn with u=ū for all ℓ ≠ j, then

|gU(u1,,uμn)gU(ū1,,ūμn)|=||f^(y;u1,,uμn)f(y)|dy|f^(y;ū1,,ūμn)f(y)|dy||f^(y;u1,,uμn)f^(y;ū1,,ūμn)|dy=2μnhndhnd=2μn.

Therefore,

(g>ε)2(gU𝔼[gU]>ε1)+2(μn1)β(mn)2exp{μnε122}+2(μn1)β(mn).

4 Proofs

The proof of Theorem 2.4 relies on the triangle inequality and the relationship between total variation distance and the L1 distance between densities.

Proof of Theorem 2.4

For any probability measures ν and λ defined on the same probability space with associated densities fν and fλ with respect to some dominating measure π,

νλTV=12|fνfλ|d(π).

Let P be the d-dimensional stationary distribution of the dth order Markov process, i.e. P=td+1t=t+at+a+d1 in the notation of equation 3. Let ℙa,d be the joint distribution of the bivariate random process created by the initial process and itself separated by a time steps. By the triangle inequality, we can upper bound βd(a) for any d = dn. Let and ℙ̂a,d be the distributions associated with histogram estimators d and f^a2d respectively. Then,

βd(a)=PPa,dTV=PPP^P^+P^P^^a,d+^a,da,dTVPPP^P^TV+P^P^^a,dTV+^a,da,dTV2PP^TV+P^P^^a,dTV+^a,da,dTV=|fdf^d|+12|f^df^df^a2d|+12|fa2df^a2d|

where 12|f^df^df^a2d| is our estimator β̂d(a) and the remaining terms are the L1 distance between a density estimator and the target density. Thus,

βd(a)β^d(a)|fdf^d|+12|fa2df^a2d|.

A similar argument starting from βd(a) = ‖PP − ℙa,dTV shows that

βd(a)β^d(a)|fdf^d|12|fa2df^a2d|,

so we have that

|βd(a)β^d(a)||fdf^d|+12|fa2df^a2d|.

Therefore,

(|βd(a)β^d(a)|>ε)(|fdf^d|+12|fa2df^a2d|>ε)(|fdf^d|>ε2)+(12|fa2df^a2d|>ε2)2exp{μnε122}+2exp{μnε222}+4(μn1)β(mn),

where ε1 = ε/2 − 𝔼 [∫|dfd|] and ε2=ε𝔼[|f^a2dfa2d|].

The proof of Theorem 2.3 requires two steps which are given in the following Lemmas. The first specifies the histogram bandwidth hn and the rate at which dn (the dimensionality of the target density) goes to infinity. If the dimensionality of the target density were fixed, we could achieve rates of convergence similar to those for histograms based on IID inputs. However, we wish to allow the dimensionality to grow with n, so the rates are much slower as shown in the following lemma.

Lemma 4.1

For the histogram estimator in Lemma 3.4, let

dn~exp{W(logn)},
hn~nkn,

with

kn=W(logn)+12lognlogn(12exp{W(logn)}+1).

These choices lead to the optimal rate of convergence.

Proof

Let hn = nkn for some kn to be determined. Then we want n1/2hndn/2=n(kndn1)/20, dnhn = dnnk → 0, and dn2hn2=dn2n2k0 all as n → ∞. Call these A, B, and C. Taking A and B first gives

n(kndn1)/2~dnnkn12(kndn1)logn~logdnknlognknlogn(12dn+1)~logdn+12lognkn~logdn+12lognlogn(12dn+1). (9)

Similarly, combining A and C gives

kn~2logdn+12lognlogn(12dn+2). (10)

Equating (9) and (10) and solving for dn gives

dn~exp{W(logn)}

where W(·) is the Lambert W function. Plugging back into (9) gives that

hn=nkn

where

kn=W(logn)+12lognlogn(12exp{W(logn)}+1).

It is also necessary to show that as d grows, βd(a) → β(a). We now prove this result.

Lemma 4.2

βd(a) converges to β(a) as d → ∞.

Proof

By stationarity, the supremum over t is unnecessary in Definition 2.1, so without loss of generality, let t = 0. Let 0 be the distribution on σ0=σ(,X1,X0), and let a be the distribution on σa+1=σ(Xa+1,Xa+2,). Let ℙa be the distribution on σ=σ0σa+1 (the product sigma-field). Then we can rewrite Definition 2.1 using this notation as

β(a)=supCσ|a(C)[0a+1](C)|.

Let σd+10 and σa+1a+d be the sub-σ-fields of σ0 and σa+1 consisting of the d-dimensional cylinder sets for the d dimensions closest together. Let σd be the product σ-field of these two. Then we can rewrite βd(a) as

βd(a)=supCσda(C)[0a+1](C)|. (11)

As such βd(a) ≤ β(a) for all a and d. We can rewrite (11) in terms of finite-dimensional marginals:

βd(a)=supCσd|a,d(C)[d+10a+1a+d](C)|,

where ℙa,d is the restriction of ℙ to σ(Xd+1,…, X0, Xa+1,…, Xa+d). Because of the nested nature of these sigma-fields, we have

βd1(a)βd2(a)β(a)

for all finite d1d2. Therefore, for fixed {βd(a)}d=1 is a monotone increasing sequence which is bounded above, and it converges to some limit L ≤ β(a). To show that L = β(a) requires some additional steps.

Let R=a[0a], which is a signed measure on σ. Let Rd=a,d[d0aa+d], which is a signed measure on σd. Decompose R into positive and negative parts as R = Q+Q and similarly for Rd = Q+dQd. Notice that since Rd is constructed using the marginals of ℙ, then R(E) = Rd(E) for all E ∈ σd. Now since R is the difference of probability measures, we must have that

0=R(Ω)=Q+(Ω)Q(Ω)=Q+(D)+Q+(Dc)Q(D)Q(Dc) (12)

for all D ∈ σ.

Define Q = Q+ + Q. Let ε > 0. Let C ∈ σ be such that

Q(C)=β(a)=Q+(C)=Q(Cc). (13)

Such a set C is guaranteed by the Hahn decomposition theorem (letting C* be a set which attains the supremum in (11), we can throw away any subsets with negative R measure) and (12) assuming without loss of generality that a(C)>[0a](C). We can use the field σf = ∪d σd to approximate σ in the sense that, for all ε, we can find A ∈ σf such that Q(AΔC) < ε/2 (see Theorem D in Halmos [11, §13] or Lemma A.24 in Schervish [21]). Now,

Q(AΔC)=Q(ACc)+Q(CAc)=Q(ACc)+Q+(CAc)

by (13) since ACcCc and CAcC. Therefore, since Q(AΔC) < ε/2, we have

Q(ACc)ε/2 (14)
Q+(AcC)ε/2.

Also,

Q(C)=Q(AC)+Q(AcC)=Q+(AC)+Q+(AcC)Q+(A)+ε/2

since AC and AcC are contained in C and ACA. Therefore

Q+(A)Q(C)ε/2.

Similarly,

Q(A)=Q(AC)+Q(ACc)0+ε/2=ε/2

since ACC and Q(C) = 0 by (14). Finally,

Q+d(A)Q+d(A)Qd(A)=Rd(A)=R(A)=Q+(A)Q(A)Q(C)ε/2ε/2=Q(C)ε=β(a)ε.

And since βd(a) ≥ Q+d(A), we have that for all ε > 0 there exists d such that for all d1 > d,

βd1(a)βd(a)Q+d(A)β(a)ε.

Thus, we must have that L = β(a), so that βd(a) → β(a) as desired.

Proof of Theorem 2.3

By the triangle inequality,

|β^dn(a)β(a)||β^dn(a)βdn(a)|+|βdn(a)β(a)|.

The first term on the right is bounded by the result in Theorem 2.4, where we have shown that dn = O(exp{W(log n)}) is slow enough for the histogram estimator to remain consistent. That βdn(a)dnβ(a) follows from Lemma 4.2.

5 Discussion

We have shown that our estimator of the β-mixing coefficients is consistent for the true coefficients β(a) under some conditions on the data generating process. There are numerous results in the statistics and machine learning literatures which assume knowledge of the β-mixing coefficients, yet as far as we know, this is the first estimator for them. An ability to estimate these coefficients will allow researchers to apply existing results to dependent data without the need to arbitrarily assume their values. Despite the obvious utility of this estimator, as a consequence of its novelty, it comes with a number of potential extensions which warrant careful exploration as well as some drawbacks.

The reader will note that Theorem 2.3 does not provide a convergence rate. The rate in Theorem 2.4 applies only to the difference between β̂d(a) and βd(a). In order to provide a rate in Theorem 2.3, we would need a better understanding of the non-stochastic convergence of βd(a) to β(a). It is not immediately clear that this quantity can converge at any well-defined rate. In particular, it seems likely that the rate of convergence depends on the tail of the sequence {β(a)}a=1.

Several other mixing and weak-dependence coefficients also have a total-variation flavor, perhaps most notably α-mixing [9, 7, 4]. None of them have estimators, and the same trick might well work for them, too.

The use of histograms rather than kernel density estimators for the joint and marginal densities is surprising and perhaps not ultimately necessary. As mentioned above, Tran [23] proved that KDEs are consistent for estimating the stationary density of a time series with β-mixing inputs, so perhaps one could replace the histograms in our estimator with KDEs. However, this would need an analogue of the double asymptotic results proven for histograms in Lemma 3.4. In particular, we need to estimate increasingly higher dimensional densities as n → ∞. This does not cause a problem of small-n-large-d since d is chosen as a function of n, however it will lead to increasingly higher dimensional integration. For histograms, the integral is always trivial, but in the case of KDEs, the numerical accuracy of the integration algorithm becomes increasingly important. This issue could swamp any efficiency gains obtained through the use of kernels. However, this question certainly warrants further investigation.

The main drawback of an estimator based on a density estimate is its complexity. The mixing coefficients are functionals of the joint and marginal distributions derived from the stochastic process X, however, it is unsatisfying to estimate densities and solve integrals in order to estimate a single number. Vapnik’s main principle for solving problems using a restricted amount of information is

When solving a given problem, try to avoid solving a more general problem as an intermediate step [24, p. 30].

This principle is clearly violated here, but perhaps our seed will precipitate a more aesthetically pleasing solution.

Acknowledgements

The authors wish to thank Darren Homrighausen and two anonymous reviewers for helpful comments and the Institute for New Economic Thinking for supporting this research.

Footnotes

Appearing in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA.

1

The literature on algorithmic stability refers to this as β-stability (e.g. Bousquet and Elisseeff [3]).

2

While it is clearly possible to replace histograms with other choices of density estimators (most notably kernel density estimators), histograms in this case are more con- venient theoretically and computationally. See §5 for more details.

3

The Lambert W function is defined as the (multivalued) inverse of f(w) = w exp{w}. Thus, O(exp{W(log n)}) is bigger than O(log log n) but smaller than O(log n). See for example Corless et al. [6].

Contributor Information

Daniel J. McDonald, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, danielmc@stat.cmu.edu

Cosma Rohilla Shalizi, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, cshalizi@stat.cmu.edu.

Mark Schervish, Department of Statististics, Carnegie Mellon University, Pittsburgh, PA 15213, mark@cmu.edu.

References

  • 1.Baraud Y, Comte F, Viennet G. Adaptive estimation in autoregression or β-mixing regression via model selection. Annals of statistics. 2001;29:839–875. [Google Scholar]
  • 2.Bickel P, Rosenblatt M. On Some Global Measures of the Deviations of Density Function Estimates. The Annals of Statistics. 1973;1:1071–1095. [Google Scholar]
  • 3.Bousquet O, Elisseeff A. Stability and Generalization. The Journal of Machine Learning Research. 2002;2:499–526. [Google Scholar]
  • 4.Bradley RC. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probability Surveys. 2005;2:107–144. [Google Scholar]
  • 5.Carrasco M, Chen X. Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models. Econometric Theory. 2002;18:17–39. [Google Scholar]
  • 6.Corless R, Gonnet G, Hare D, Jeffrey D, Knuth D. On the Lambert W Function. Advances in Computational Mathematics. 1996;5:329–359. [Google Scholar]
  • 7.Dedecker J, Doukhan P, Lang G, Leon R, J R, Louhichi S, Prieur C. Weak Dependence: With Examples and Applications, vol. 190 of Lecture Notes in Statistics. New York: Springer Verlag; 2007. [Google Scholar]
  • 8.Devroye L, Györfi L. Nonparametric Density Estimation: The L1 View. New York: Wiley; 1985. [Google Scholar]
  • 9.Doukhan P. Mixing: Properties and Examples, vol. 85 of Lecture Notes in Statistics. New York: Springer Verlag; 1994. [Google Scholar]
  • 10.Freedman D, Diaconis P. On the Maximum Deviation Between the Histogram and the Underlying Density. Probability Theory and Related Fields. 1981;58:139–167. [Google Scholar]
  • 11.Halmos P. Measure Theory, Graduate Texts in Mathematics. New York: Springer-Verlag; 1974. [Google Scholar]
  • 12.Karandikar RL, Vidyasagar M. Probably Approximately Correct Learning with Beta-Mixing Input Sequences. 2009 submitted for publication. [Google Scholar]
  • 13.Lozano A, Kulkarni S, Schapire R. Convergence and Consistency of Regularized Boosting Algorithms with Stationary Beta-Mixing Observations. Advances in Neural Information Processing Systems. 2006;18:819. [Google Scholar]
  • 14.McDiarmid C. On the Method of Bounded Differences. In: Siemons J, editor. Surveys in Combinatorics. Cambridge University Press; 1989. pp. 148–188. vol. 141 of London Mathematical Society Lecture Note Series. [Google Scholar]
  • 15.Meir R. Nonparametric Time Series Prediction Through Adaptive Model Selection. Machine Learning. 2000;39:5–34. [Google Scholar]
  • 16.Mohri M, Rostamizadeh A. Stability Bounds for Stationary φ-mixing and β-mixing Processes. Journal of Machine Learning Research. 2010;11:789–814. [Google Scholar]
  • 17.Mokkadem A. Mixing properties of ARMA processes. Stochastic processes and their applications. 1988;29:309–315. [Google Scholar]
  • 18.Nobel A. Hypothesis Testing for Families of Ergodic Processes. Bernoulli. 2006;12:251–269. [Google Scholar]
  • 19.Nummelin E, Tuominen P. Geometric Ergodicity of Harris Recurrent Markov Chains with Applications to Renewal Theory. Stochastic Processes and Their Applications. 1982;12:187–202. [Google Scholar]
  • 20.Ralaivola L, Szafranski M, Stempfel G. Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes. Journal of Machine Learning Research. 2010;11:1927–1956. [Google Scholar]
  • 21.Schervish M. Theory of Statistics, Springer Series in Statistics. New York: Springer Verlag; 1995. [Google Scholar]
  • 22.Silverman B. Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivatives. The Annals of Statistics. 1978;6:177–184. [Google Scholar]
  • 23.Tran L. The L1 Convergence of Kernel Density Estimates under Dependence. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 1989;17:197–208. [Google Scholar]
  • 24.Vapnik V. Statistics for Engineering and Information Science. 2nd edn. New York: Springer Verlag; 2000. The Nature of Statistical Learning Theory. [Google Scholar]
  • 25.Vidyasagar M. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Berlin: Springer Verlag; 1997. [Google Scholar]
  • 26.Woodroofe M. On the Maximum Deviation of the Sample Density. The Annals of Mathematical Statistics. 1967;38:475–481. [Google Scholar]
  • 27.Yu B. Density Estimation in the L∞ Norm for Dependent Data with Applications to the Gibbs Sampler. Annals of Statistics. 1993;21:711–735. [Google Scholar]
  • 28.Yu B. Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability. 1994;22:94–116. [Google Scholar]

RESOURCES