Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jun 19.
Published in final edited form as: J Am Stat Assoc. 2018 Jun 19;113(524):1698–1709. doi: 10.1080/01621459.2017.1360779

A Massive Data Framework for M-Estimators with Cubic-Rate

Chengchun Shi 1, Wenbin Lu 1, Rui Song 1,
PMCID: PMC6364750  NIHMSID: NIHMS915728  PMID: 30739966

Abstract

The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing the asymptotic distribution of the aggregated M-estimators using a weighted average with weights depending on the subgroup sample sizes. Under certain condition on the growing rate of the number of subgroups, the resulting aggregated estimators are shown to have faster convergence rate and asymptotic normal distribution, which are more tractable in both computation and inference than the original M-estimators based on pooled data. Our theory applies to a wide class of M-estimators with cube root convergence rate, including the location estimator, maximum score estimator and value search estimator. Empirical performance via simulations and a real data application also validate our theoretical findings.

Keywords: Cubic rate asymptotics, divide and conquer, M-estimators, massive data

1 Introduction

In a world of explosively large data, effective estimation procedures are needed to deal with the computational challenge arisen from analysis of massive data. The divide and conquer method is a commonly used approach for handling massive data, which divides data into several groups and aggregate all subgroup estimators by a simple average to lessen the computational burden. A number of problems have been studied for the divide and conquer method, including variable selection (Chen and Xie, 2014), nonparametric regression (Zhang et al., 2013; Zhao et al., 2016) and bootstrap inference (Kleiner et al., 2014), to mention a few. Most papers establish that the aggregated estimators achieve the oracle result, in the sense that they possess the same nonasymptotic error bounds or limiting distributions as the pooled estimators, which are obtained by fitting all the data in a single model. This implies that the divide and conquer scheme can not only maintain efficiency, but also obtain a feasible solution for analyzing massive data.

In addition to the computational advantages for handling massive data, the divide and conquer method, somewhat surprisingly, can lead to aggregated estimators with improved efficiency over pooled estimators with slower than the usual n1/2 convergence rate. A recent independent work of Banerjee et al. (2016) studied the divide and conquer principle in the monotone regression setting where the estimator converges at n1/3 rate. In particular, they showed the aggregated estimator obtained by averaging all subgroup estimators converges much faster than the pooled estimator based on all observations and is asymptotically normal. This phenomenon is expected to hold under many other cube-root estimation problems. For example, Chernoff (1964) studied a cubic-rate estimator for estimating the mode. It was shown therein that the estimator converges in distribution to the argmax of a Brownian motion minus a quadratic drift. Kim and Pollard (1990) systematically studied a class of cubic-rate M-estimators and established their limiting distributions as the argmax of a general Gaussian process minus a quadratic form. These results were extended to a more general class of M-estimators using modern empirical process results (van der Vaart and Wellner, 1996; Kosorok, 2008). In this paper, we study a class of M-estimators with cubic-rate and develop a general inference framework for the aggregated estimators obtained by the divide and conquer method. Our theory states that the aggregated estimators can achieve a faster convergence rate than the pooled estimators and have asymptotic normal distributions when the number of groups diverges at a proper rate as the sample size of each group grows. This enables a simple way for estimating the covariance matrix of the aggregated estimators.

When establishing the asymptotic properties of the aggregated estimators, a major technical challenge is to quantify the accumulated bias. Different from estimators with standard n1/2 convergence rate, M-estimators with n1/3 convergence rate generally do not have a nice linearization representation and the magnitude of the associated biases is difficult to quantify. One way to obtain the magnitude of the bias is by establishing a coupling inequality for the cubic-rate estimator. For example, Banerjee et al. (2016) derived a nonasymptotic bound for the biases of the isotonic estimator in a monotone regression model and its inverse, based on the coupling inequality of the isotonic estimator (see Lemma 8.10 in that paper, and also Equation (29) in Durot (2002)). Groeneboom et al. (1999) provided a coupling inequality for the inverse process of the Grenander estimator. Their results can be used to establish the bias of the Grenander estimator. While such strategy is useful for studying the bias of some one-dimensional cubic-rate estimators, it is not suitable for multi-dimensional estimators. On one hand, these coupling inequalities are all based on Komos-Major-Tusnady (KMT) approximation (Komlós et al., 1975) and its extensions (cf. Csörgő et al., 1985; Sakhanenko, 2006) that only apply to the empirical distribution or the quantile process. There are extensions of the KMT approximation for more general empirical process (cf. Rio, 1994; Koltchinskii, 1994). However, the rate of the approximation will depend on the dimension of the parameter and decays fast as the dimension increases. On the other hand, proofs of these coupling inequalities all rely on the properties of the argmax of a Brownian motion process with a parabolic drift (cf. Proposition 1 in Durot (2002) and the discussions therein), and are not applicable to cubic-rate estimators that converge to the argmax of a more general Gaussian process minus a quadratic term. Here, we propose a novel approach to derive an upper bound for the bias, without establishing the coupling inequalities. To the best of our knowledge, this is the first time that a nonasymptotic error bound for the bias of a general cubic-rate estimator is provided.

A key innovation in our analysis is to introduce a linear perturbation in the empirical objective function. In that way, we transform the problem of quantifying the bias into comparison of the expected supremum of the empirical objective function and that of its limiting Gaussian process. To bound the difference of these expected suprema, we adopt similar techniques that have been recently studied by Chernozhukov et al. (2013) and Chernozhukov et al. (2014). Specifically, they compared a function of the maximum for sum of mean-zero Gaussian random vectors with that of multivariate mean-zero random vectors with the same covariance function, and provided an associated coupling inequality. We improve their arguments by providing more accurate approximation results (Lemma A.3) for the identity function of maximums as needed in our applications.

Another major contribution of this paper is to provide a tail inequality for cubic-rate M-estimators (Theorem 5.1). This helps us to construct a truncated estimator with bounded second moment, which is essential to apply Lyapunov’s central limit theorem for establishing the normality of the aggregated estimator. Under some additional tail assumptions on the underlying empirical process, our results can be viewed as a generalization of empirical process theories that establish consistency and n1/3 convergence rate for the M-estimators. Based on the results, we show that the asymptotic variance of the aggregated estimator can be consistently estimated by the sample variance of individual M-estimators in each group, which largely simplifies the inference procedure for M-estimators.

The rest of the paper is organized as follows. We describe the divide and conquer method for M-estimators and state the major central limit theorem (Theorem 2.1) in Section 2. Three examples for the location estimator, maximum score estimator and value search estimator are presented in Section 3 to illustrate the application of Theorem 2.1. In Section 4, we demonstrate the empirical performance of the aggregated estimators using both simulation studies and an application to the Yahoo! Front Page Today Module user click log dataset. Section 5 studies a tail inequality that are needed to prove Theorem 2.1, followed by a Discussion Section. All the technical proofs are provided in the Appendix.

2 Method

The divide and conquer scheme for M-estimators is described as follows. In the first step, the data are randomly divided into several groups. For the jth group, consider the following M-estimator

θ^(j)=argmaxθΘnj(j)m(·,θ)argmaxθΘ1nji=1njm(Xi(j),θ),j=1,,S,

where ( X1(j),,Xnj(j)) denote the data for the jth group, nj is the number of observations in the jth group, S is the number of groups, m(·, ·) is the objective function and θ is a d-dimensional vector of parameters that belong to a compact parameter space Θ. In the second step, the aggregated estimator θ̂0 is obtained as a weighted average of all subgroup estimators,

θ^0=j=1Sωjθ^(j)=j=1Snj2/3θ^(j)j=1Snj2/3. (1)

Remark 2.1

The weights ωj’s are chosen such that θ̂0 achieves the smallest asymptotic covariance matrix among the class of linearly aggregated estimators {θω = Σj ωjθ̂(j)j ωj = 1, ωj ≥ 0, ∀j = 1, … , S} (see Section F in the supplementary appendix for detailed illustrations). When n1 = n2 = · · · = nS, θ̂0 reduces to a simple average of all θ̂(j)’s.

We assume that all the Xi(j)’s are independent and identically distributed across i and j. Here, we only consider M-estimation with non-smooth functions m(·, θ) of θ, and the resulting M-estimators θ̂(j)’s have a convergence rate of Op(nj-1/3). Such cubic-rate M-estimators have been widely studied in the literature, for example, the location estimator and maximum score estimator as demonstrated in the next section. Define Nj nj and n = N/S. The main goal of this paper is to establish the convergence rate and asymptotic normality of θ̂0 under suitable conditions for S and nj ’s.

Before introducing our main results, we first provide an intuitive explanation here why the divide and conquer method can improve the efficiency in cubic-rate M-estimation problems. Assume for now, n1 = n2 = · · · = nS = n and S is fixed. Following Kim and Pollard (1990), we can show that

n1/3(θ^(j)-θ0)dh0,N1/3(θ0-θ0)dh0,

where θ̃0 is the pooled estimator, i.e, θ0=argmaxθΘi,jm(Xi(j),θ), θ0 is the unique maximizer of E{m(·, θ)} and h0 = argmaxh Z(h) with

Z(h)=G(h)-12hTVh. (2)

Here G is a mean-zero Gaussian process and V = 2E{m(·, θ)}/∂θθT |θ=θ0 is a positive definite matrix.

Assume N1/3(θ0-θ0)22 and n1/3(θ^(j)-θ0)22 are uniformly integrable. Then, we have

N2/3E(θ0-θ0)(θ0-θ0)Tcov(h0),asNn2/3E(θ^(j)-θ0)(θ^(j)-θ0)Tcov(h0),asN. (3)

Under equal allocation, θ̂(j)’s are independent and identical. We have

N2/3E{(θ^0-θ0)(θ^0-θ0)T}=N2/31S2j=1SE{(θ^(j)-θ0)(θ^(j)-θ0)T}+N2/31S2jkE{(θ^(j)-θ0)E(θ^(k)-θ0)T}=n2/3S1/3E{(θ^(1)-θ0)(θ^(1)-θ0)T}+bnbnTS2/3(S-1)/SS-1/3cov(h0), (4)

where bn = n1/3E(θ̂(j)θ0) = o(1) is the bias of n1/3θ̂(j). Comparing (3) with (4), we can see that the aggregated estimator is more efficient than the pooled estimator in the fixed S scenario.

Now let S grow with N. As long as S satisfies S=O(1/(bn22)), we have

bnbnTS2/3(S-1)/SO(S-1/3),

and hence N2/3E{(θ̂0θ0)(θ̂0θ0)T } = O(S−1/3). In view of (3), this implies that the aggregated estimator can have a faster convergence rate than the pooled estimator.

2.1 Main results

We assume the dimension d is fixed, while the number of groups S →∞ as N →∞. Let || · ||2 denote the Euclidean norm for vectors or induced matrix L2 norm for matrices. We first introduce some conditions.

  • (A1.)

    There exists a small neighborhood Nδ = {θ : ||θθ0||2δ} in which Em{(·, θ)} is twice continuously differentiable with the Hessian matrix −V (θ), where V (θ) is positive definite in Nδ. Moreover, assume E{m(·,θ0)}>supθNδcE{m(·,θ)}.

  • (A2.)

    For any θ1, θ2Nδ, we have E{|m(·, θ1) − m(·, θ2)|2} ≤ K||θ1θ2||2 for a constant K that is independent of θ1 and θ2.

  • (A3.)

    There exists some positive constant ω such that |m(x, θ)| ≤ ω for all x and θ.

  • (A4.)

    The envelope function MR(·) ≡ supθ {|m(·, θ)| : ||θθ0||2R} satisfies EMR2=O(R) when Rδ.

  • (A5.)

    The set of functions {m(·, θ)|θ ∈ Θ} has Vapnik-Chervonenkis (VC) index 1 ≤ v < ∞.

  • (A6.)

    For any θNδ, ||V (θ) − V ||2 = O(||θθ0||2), where V = V (θ0).

  • (A7.)
    Let L(·) denote the variance process of G(·) satisfying L(h) > 0 whenever h = 0. (i) The function L(·) is symmetric and continuous, and has the rescaling property: L(kh) = kL(h) for k > 0. (ii) For any h1, h2 ∈ ℝd satisfying ||h1||2n1/3δ and ||h2||2n1/3δ, we have
    |L(h1-h2)-n1/3E{m(·,θ0+n-1/3h1)-m(·,θ0+n-1/3h2)}2|=O((h1+h2)2n1/3).
  • (A8.)

    Let cj = nj/n. Assume there exists some constant > 1 such that 1/cj for all j.

Theorem 2.1

Under Conditions (A1)–(A8), if S = o(n1/6/log5/6 n) and S →∞ as n → ∞, we have

c12/3++cS2/3n1/3(θ^0-θ0)dN(0,A), (5)

for some positive definite matrix A.

Remark 2.2

Under Condition A8, Theorem 2.1 suggests that θ̂0 converges at a rate of Op(S−1/2n−1/3). In contrast, the original M-estimator obtained based on pooled data has a convergence rate of Op(S−1/3n−1/3). This implies that we can gain efficiency by adopting the split and conquer scheme for cubic-rate M-estimators. Such result is interesting as most aggregated estimators in the divide and conquer literature share the same convergence rates as the original estimators based on pooled data.

Remark 2.3

The constraints on S suggest that the number of group cannot diverge too fast. A main reason as we showed in the proof of Theorem 2.1 is that if S grows too fast, the asymptotic normality of θ̂0 will fail due to accumulation of bias in the aggregation of subgroup estimators. Given a data of size N, we can take SNl, n = N/SN1−l with l < 1/7 to fulfill this requirement. It turns out that this requirement on S can be relaxed under some special cases. In particular, when d = 1, i.e, θ0 is a scalar, we show in the supplementary appendix that the aggregated estimator is asymptotically normal as long as SNl with l < 4/13. Details can be found in Section A.5 of the supplementary appendix.

Remark 2.4

Conditions A1 – A5 and A7 (i) are similar to those in Kim and Pollard (1990) and are used to establish the cubic-rate convergence of the M-estimator in each group. Conditions A6 and A7 (ii) are used to establish the normality of the aggregated estimator. In particular, Condition A7 (ii) implies that the Gaussian process G(·) has stationary increments, i.e. E[{G(h1) − G(h2)}2] = L(h1h2) for any h1, h2 ∈ ℝd, which is used to control the bias of the aggregated estimator. Condition A8 automatically holds when n1 = · · · = nS.

In the rest of this section, we give a sketch for the proof of Theorem 2.1. The details of the proof are given in Section 5 and Section A in the supplementary appendix. Let h^(j)=nj1/3(θ^(j)-θ0). By definition, it is equivalent to show

1c12/3++cS2/3j=1Scj1/3h^(j)dN(0,A). (6)

When S diverges, intuitively, (6) follows by a direct application of central limit theorem for triangular arrays (cf. Theorem 11.1.1, Athreya and Lahiri, 2006). However, a few challenges remain. First, the estimator ĥ(j) may not possess finite second moment. Analogous to Kolmogorov’s 3-series theorem (cf. Theorem 8.3.5, Athreya and Lahiri, 2006), we handle this by first defining (j), which is a truncated version of ĥ(j) with ||(j)||2δnj for some δnj > 0, such that Σjĥ(j) and Σj(j) are tail equivalent, i.e.

limkPr(nk{j=1S(n)cj1/3h^(j)=j=1S(n)cj1/3h(j)})=1.

Using Borel-Cantelli lemma, it suffices to show

nPr(j=1S(n)cj1/3h^(j)j=1S(n)cj1/3h(j))<. (7)

Now it remains to show

1jcj2/3j=1Scj1/3h(j)=1jcj2/3j=1S{h(j)-E(h(j))}+1jcj2/3jEcj1/3h(j)dN(0,A).

The second challenge is to control the accumulated bias in the aggregated estimator, i.e. showing

1jcj2/3jcj1/3E(h(j))0,

or

SsupjE(h(j))0, (8)

by Assumption A8. Finally, it remains to show that the second and third moments of (j) satisfies

supjE(aTh(j))2-aTAa0, (9)
supjEh(j)23<, (10)

for any a ∈ ℝd. When (7), (8), (9) and (10) are established, Theorem 2.1 follows by Lyapunov’s central limit theorem (cf. Corollary 11.1.4 Athreya and Lahiri, 2006). Section 5 is devoted to verifying (7), (9) and (10), while Section A in the supplementary appendix is devoted to proving (8).

3 Applications

In this section, we illustrate our main theorem (Theorem 2.1) with three applications including simple one-dimensional location estimator (Example 3.1) and more complicated multi-dimensional estimators with some constraints, such as maximum score estimator (Example 3.2) and value-search estimator (Example 3.3).

3.1 Location estimator

Let Xi(j)(i=1,,n;j=1,,S) be i.i.d. random variables on the real line, with a continuous density p. In each subgroup j, consider the location estimator

θ^(j)=argmaxθ1ni=1nI(θ-1Xiθ+1).

It was shown in Example 3.2.13 of van der Vaart and Wellner (1996) and Example 6.1 of Kim and Pollard (1990) that each θ̂(j) has a cubic-rate convergence. We assume that Pr(X ∈ [θ − 1, θ + 1]) has a unique maximizer at θ0. When the derivative of p exists and is continuous, p′(θ0 − 1) − p′(θ0 + 1) > 0 implies that the second derivative of Pr(X ∈ [θ − 1, θ + 1]) is negative for all θ within some small neighbor Nδ around θ0. Therefore, Condition (A2) holds, since

EI(θ1-1Xθ1+1)-I(θ2-1Xθ2+1)2=Pr(θ1-1Xθ2-1)+Pr(θ1+1Xθ2+1)supθNδ{p(θ-1)+p(θ+1)}θ1-θ2,

for θ1θ2 and |θ1θ2| < 0.5. Moreover, if we further assume that p has continuous second derivative in the neighborhood Nδ, Condition (A6) is satisfied.

The class of functions {|I(θ −1 ≤ Xθ +1)| : θ ∈ Θ} is bounded by 1 and belongs to VC class. In addition, we have

supθ-θ0<εI(θ-1Xθ+1)-I(θ0-1Xθ0+1)I(θ0-1-εXθ0-1+ε)+I(θ0+1-εXθ0+1+ε),

for small ε. The L2(P) norm of the function on the second line is O(ε). Hence, Conditions (A4) and (A5) hold.

Next, we claim that Condition (A7) holds for function L(h) ≡ 2p(θ0 +1)|h|, or equivalently {p(θ0 − 1) + p(θ0 + 1)}|h|, since p(θ0 − 1) = p(θ0 + 1). Obviously, L(·) is symmetric and satisfies the rescaling property. For any h1, h2 such that max(|h1|, |h2|) ≤ n1/3δ, we define θ1 = θ0 + n−1/3h1Nδ and θ2 = θ0 + n−1/3h2Nδ. Let [a, b] denote the indicator function I(aXb). Assume h1h2. We have

n1/3E[θ1-1,θ1+1]-[θ2-1,θ2+1]2=n1/3E[θ1-1,θ2-1]+n1/3E[θ1+1,θ2+1]=n1/3θ1-1θ2-1p(θ)dθ+n1/3θ1+1θ2+1p(θ)dθ={p(θ0+1)+p(θ0-1)}(h2-h1)+R,

where the remainder term R is bounded by

supθ1θθ2(p(θ-1)-p(θ0-1)+p(θ+1)-p(θ0+1))(h2-h1)supθNδ4n-1/3p(θ)(h2-h1)max(h1,h2)supθNδ4n-1/3p(θ)(h1+h2)2,

using a first order Taylor expansion. The case when h1 > h2 can be similarly discussed. Therefore, Condition (A7) holds. Theorem 2.1 then follows.

3.2 Maximum score estimator

Consider the regression model Yi(j)=Xi(j)Tβ0+ei(j), , j = 1, · · · , S, where Xi(j) is a d-dimensional vector of covariates and ei(j) is the random error. Assume that ( Xi(j),ei(j))’s are i.i.d. copies of (X, e). The maximum score estimator is defined as

β^(j)=argmaxβ2=1i=1n{I(Yi(j)0,Xi(j)Tβ0)+(Yi(j)<0,Xi(j)Tβ<0)},

where the constraint ||β||2 = 1 is to guarantee the uniqueness of the maximizer.

Assume ||β0|| = 1, otherwise we can define β = β0/||β0||2 and establish the asymptotic distribution of β̂0β instead. It was shown in Example 6.4 of Kim and Pollard (1990) that β̂(j) has a cubic-rate convergence, when (i) median(e|X) = 0; (ii) X has a bounded, continuously differentiable density p; and (iii) the angular component of X has a bounded continuous density with respect to the surface measure on 𝒮d−1, which corresponds to the unit sphere in ℝd.

Theorem 2.1 is not directly applicable to this example since Assumption (A1) is violated. The Hessian matrix

V=-2E{I(Yi(j)0,Xi(j)Tβ0)+I(Yi(j)<0,Xi(j)Tβ<0)}ββTβ0

is not positive definite. One possible solution is to use the arguments from the constrained M-estimator literature (e.g. Geyer, 1994) to approximate the set ||β||2 = 1 by a hyperplane (ββ0)T β = 0, and obtain a version of Theorem 2.1 for the constrained cubic-rate M-estimators. We adopt an alternative approach here, and consider a simple reparameterization to make Theorem 2.1 applicable.

By Gram-Schmidt orthogonalization, we can obtain an orthogonal matrix [β0, U0] with U0 being a ℝd×(d−1) matrix subject to the constraint U0Tβ0=0. Define

β(θ)=1-θ22β0+U0θ, (11)

for all θ ∈ ℝd−1 and ||θ||2 ≤ 1. Take Θ to be the unit ball B2d-1 in ℝd−1. Define

θ^(j)=argmaxθΘi=1n[I(Yi(j)0,Xi(j)Tβ(θ)0)+I(Yi(j)<0,Xi(j)Tβ(θ)<0)].

Note that under the assumption median(e|X) = 0, we have θ0 = 0.

Let m(y, x, β) = I(y ≥ 0, xT β ≥ 0) + I(y < 0, xTβ < 0). Define

κ(x)=E{I(e+XTβ00)-I(e+XTβ0<0)X=x}.

It is shown in Kim and Pollard (1990) that

E{m(·,·,β)}β=β2-2βTβ0(I+β2-2ββT)xTβ0=0κ(Tβx)p(Tβx)dσ, (12)

where

Tβ=(I-β2-2ββT)(I-β0β0T)+β2-1ββ0T,

and σ is the surface measure on the line xT β0 = 0.

Note that ∂β(θ)/∂θ has finite derivatives for all orders as long as ||θ||2 < 1. Assume that κ and p have twice continuous derivatives. This together with (12) implies that E{m(·, ·, β(θ))} has third continuous derivative as a function of θ in a small neighborhood Nδ (δ < 1) around 0. This verifies (A6). Moreover, for any θ1, θ2Nδ with ||θ1θ2||2ε, we have

β(θ1)-β(θ2)22=θ1-θ222+(1-θ122-1-θ222)2=θ1-θ222+(1-θ122-1+θ222)2(1-θ122+1-θ222)22θ1-θ2221-δ2. (13)

Kim and Pollard (1990) showed that E{|m(·, ·, β1) −m(·, ·, β2)|} = O(||β1β2||2) near β0. This together with (13) implies

E{m(·,·,β(θ1))-m(·,·,β(θ2))2}2E{m(·,·,β(θ1))-m(·,·,β(θ2))}=O(θ1-θ222).

Therefore, (A2) is satisfied and (A3) trivially holds since |m| ≤ 1.

It was also shown in Kim and Pollard (1990) that the envelope Mε of the class of functions {m(·, ·, β)−m(·, ·, β0) : ||ββ0||2ε} satisfies EMε2=O(ε). Using (13), we can show that the envelope ε of the class of functions {m(·, ·, β(θ)) − m(·, ·, β0) : ||θ||2ε} also satisfies EMε2=O(ε). Thus, (A4) is satisfied. Moreover, since the class of functions m(·, ·, β) over all β belongs to the VC class, so does the class of function m(·, ·, β(θ)). This verifies (A5).

Finally, we establish (A7). For any θ1, θ2Nδ, define h1 = n1/3θ1 and h2 = n1/3θ2. We have

n1/3E{m(Y,X,β(h1/n1/3))-m(Y,X,β(h2/n1/3))2}=n1/3E{I(XTβ(h1/n1/3)0)-I(XTβ(h2/n1/3)0)I(Y0)}+n1/3E{I(XTβ(h1/n1/3)<0)-I(XTβ(h2/n1/3)<0)I(Y<0)}=n1/3E{I(XTβ(h1/n1/3)0)-I(XTβ(h2/n1/3)0)}. (14)

We write X as 0 + z with z orthogonal to β0. Equation (14) can be written as

n1/3E{|I(r1-h1n1/322+zTUh1n1/30)-I(r1-h2n1/322+zTUh2n1/30)|}. (15)

Define ω = n1/3r. Equation (15) can be expressed as

I(-zTUh1(1-n-2/3h122)-1/2>ω-zTUh2(1-n-2/3h222)-1/2)p(ωn1/3,z)dωdz.

Assume that p(r, z) is differentiable with respect to r and |∂p(r, z)/∂r| ≤ q(z) for some function q. Then, (15) is equal to

zTU{h1(1-n-2/3h122)-1/2-h2(1-n-2/3h222)-1/2}p(0,z)dz+R1=zTU(h1-h2)p(0,z)dz+R1+R2,

where the remainders |R1| and |R2| are bounded by

R1n-1/3{(zTUh1)2+(zTUh2)2}q(z)dz=O(n-1/3{h122+h222}),

and

R2(1-n-2/3h122)-1/2-1zTUh1p(0,z)dz+(1-n-2/3h222)-1/2-1zTUh2p(0,z)dzn-1/3(h12+h2)(zTUh1+zTUh2)p(0,z)dz=O(n-1/3{h122+h222}),

under suitable moment assumptions on functions p(0, z) and q(z). This verifies (A7).

An application of Theorem 2.1 implies

1Sj=1Sn1/3θ^(j)dN(0,A),

for some positive definite matrix A ∈ ℝ(d−1)×(d−1). Hence

1Sj=1Sn1/3Uθ^(j)dN(0,UAUT). (16)

By the definition of θ̂(j) and β̂(j), we have

|1Sj=1Sn1/3(β^(j)-β0-Uθ^(j))||1Sj=1Sn1/31-θ^(j)22-1||1Sj=1Sn1/31-θ^(j)22-11-θ^(j)22+1|n1/3Sjθ^(j)22.

With probability at least 1−S/n → 1, the last expression is equal to O(Sn1/3n-2/3log2/3n)=o(1), which is implied by the tail inequality for θ̂(j) established in Theorem 5.1. Combining this together with (16), we have

1Sj=1Sn1/3(β^(j)-β0)dN(0,UAUT).

3.3 Value search estimator

The value search estimator was introduced by Zhang et al. (2012) for estimating the optimal treatment regime. The data can be summarized as i.i.d. triples { Oi(j)=(Xi(j),Ai(j),Yi(j)),i = 1,…,n;j = 1,…,S}, where Xi(j)d denote patient’s baseline covariates, Ai(j) is the treatment received by the patient taking the value 0 or 1, and Yi(j) is the response, the larger the better by convention. Consider the following model

Yi(j)=μ(Xi(j))+Ai(j)C(Xi(j))+ei(j), (17)

where μ(·) is the baseline mean function, C(·) is the contrast function, and ei(j) is the random error with E{ei(j)Ai(j),Xi(j)}=0. The optimal treatment regime is defined in the potential outcome framework. Specifically, let Yi(j)(0) and Yi(j)(1) be the potential outcomes that would be observed if the patient received treatment 0 or 1, accordingly. For a treatment regime d that maps Xi(j) to {0, 1}, define the potential outcome

Yi(j)(d)=d(Xi(j))Yi(j)(1)+{1-d(Xi(j))}Yi(j)(0).

The optimal regime dopt is defined as the rule that maximizes the expected potential outcome, i.e, the value function, E{Yi(j)(d)}. Under the stable unit treatment value assumption (SUTVA) and no unmeasured confounders assumption (Splawa-Neyman, 1990), the optimal treatment regime under model (17) is given by dopt(x) = I{C(x) > 0}.

The true contrast function C(·) can be complex. As suggested by Zhang et al. (2012), in practice we can find the restricted optimal regimen within a class of decision rules, such as linear treatment decision rules d(x, β) = I(β1 + x1β2 + · · · + xdβd+1 > 0) indexed by β ∈ ℝd+1, where the subscript k denotes the kth element in the vector. Let β = argmaxβ V (β), where V(β)=E{Yi(j)(d(Xi(j),β))}. To make β identifiable, we assume β1=-1. Define θ=(β2,,βd+1)T. The restricted optimal treatment regime is given by opt(x, θ) = I(xT θ > 1) and the value function is defined by V(θ)=E{Yi(j)(d(Xi(j),θ))} with (x, θ) = I(xT θ > 1). Zhang et al. (2012) proposed an inverse propensity score weighted estimator of the value function V (θ) and the associated value search estimator by maximizing the estimated value function. Specifically, for each group j, the value search estimator is defined as

θ^(j)=argmaxθΘ1ni=1nd(Xi(j),θ)Ai(j)+{1-d(Xi(j),θ)}(1-Ai(j))πi(j)Ai(j)+(1-πi(j))(1-Ai(j))Yi(j), (18)

where πi(j)=Pr(Ai(j)=1Xi(j)) is the propensity score and known in a randomized study. Here, for illustration purpose, we assume that πi(j)’s are known.

Define m(Oi(j),θ)=ξi(j)d(Xi(j),θ), where

ξi(j)=Ai(j)πi(j)C(Xi(j))+Ai(j)-πi(j)πi(j)(1-πi(j)){μ(Xi(j))+ei(j)}=(Ai(j)πi(j)-1-Ai(j)1-πi(j))Yi(j).

With some algebra, we can show that θ̂(j) also maximizes n(j)m(·,θ), where n(j) is the empirical measure for data in group j. Unlike the previous two examples, here the function m is not bounded. To fulfill (A3), we need ξi(j)ψ1<. This holds when 0<γ1<πi(j)<γ2<1 for some constants γ1 and γ2, C(Xi(j))ψ1<,μ(Xi(j))ψ1< and ei(j)ψ1<.

To show (A1) and (A6), we evaluate the integral

Γ(θ)=E{ξd(X,θ)}=E{C(X)d(X,θ)}=xTθ>1C(x)p(x)dx, (19)

where p(x) is the density function of Xi(j). Consider the transformation

Tθ=(I-θ2-2θθT)+θ2-2θ(θ)T,

which maps the region {xT θ > 1} onto {xT θ > 1}, and {xT θ = 1} onto {xT θ = 1}. We exclude the trivial case with θ = 0. The above definition is meaningful when θ is taken over a small neighborhood Nδ of θ. We assume that functions p and C are continuously differentiable. Note that

Tθxθ=-{θTx-(θ)Tx}θ22I-θxTθ22+2θθT(xTθ-xTθ)θ24.

Using some differential geometry arguments similarly as in Section 5 of Kim and Pollard (1990), we can show that the integral (19) can be represented as

Γ(θ)=xTθ>1[-1θ22θTC(x)p(x)xx+{θTx-(θ)Tx}θ24θTC(x)xθ-θTx-(θ)Txθ22C(x)p(x)x]dx,

which is thrice differentiable under certain conditions on C(x), p(x) and their derivatives.

To show (A7), we assume that the conditional density p(x|y) of X given Y = 1−XT θ exists and is continuously differentiable with respect to y. Similarly assume that the density q(y) of Y exists and is continuously differentiable. Let g(X) = E(ξ2|X). For any h1, h2 ∈ ℝd, we have

n1/3E{ξ2I(XTθ+n-1/3XTh1>1)-I(XTθ+n-1/3XTh2>1)2}=n1/3g(x)I(n-1/3xTh1>y)-I(n-1/3xTh2>y)p(xy)q(y)dxdy.

Let y = n−1/3z. The last expression in the above equation can be written as

g(x)I(xTh1>z)-I(xTh2>z)p(x0)q(0)dxdz+R=g(x)xT(h1-h2)p(x0)q(0)dx+R,

with the remainder term

R=g(x)I(xTh1>z)-I(xTh2>z){p(xn-1/3z)q(n-1/3z)-p(x0)q(0)}dxdz,

which is O(n-1/3(h122+h222)) under certain conditions on q(x) and p(x|·). Conditions (A2) and (A4) can be similarly verified. Since the class of functions {g(x)I(xT θ > 1) : θ ∈ ℝd} has finite VC index, Condition (A5) also holds. Theorem 2.1 then follows.

4 Numerical studies

In this section, we examine the numerical performance of the aggregated M-estimator for the three examples studied in the previous section and compare it with the M-estimator based on pooled data, denoted as the pooled estimator.

4.1 Location estimator

The data Xj (j = 1, …, N) were independently generated from the standard normal distribution. The true parameter θ0 that maximizes E{I(θ − 1 ≤ Xjθ + 1)} was set to be 0. Let θ̃0 and θ̂0 denote the pooled estimator and the aggregated estimator, respectively. To obtain θ̂0, we randomly divided the data into S blocks with equal size n = N/S.

We took N = 2i for i = 14, 16, 18, 20, and choose S = 2j such that 0.2 ≤ j/i < 0.625 when N = 2i. For each combination of N and S, we estimated the standard error of θ̂0 by

SE^(θ^0)=1S{1S-1l=1S(θ^(l)-θ^0)2}1/2,

where θ̂ (l) denotes the M-estimator for the lth group. For each scenario, we conducted 1000 simulation replications and plot the coverage probabilities of 95% predictive intervals in Figure 1. We also report the bias and sample standard deviation (denoted as SD) of estimators θ̃0 and θ̂0, and mean of estimated standard errors and coverage probability (denoted as CP) of Wald-type 95% confidence interval for θ̂0 in Table 1 of the supplementary appendix, for some of the scenarios where N = 2i for i = 14, 16, 18, 20, and S = 2j for j = 4,5,6,7. Unlike θ̂0, θ̃0 doesn’t converge to a tractable limiting distribution and it doesn’t have a convenient variance estimator. Hence, in Table 1, we didn’t provide the standard errors and confidence intervals for θ̃0.

Figure 1.

Figure 1

Coverage probability of 95% predictive interval with different choices of N and S, for the location estimator

From Figure 1, it is clear that for this specific application, the coverage probabilities are approximately 95% when SS* where S* ≈ N0.55. In this example, the cubic rate estimator is one dimensional and according to Theorem A.1 and the discussion in Section A.5.1 in the supplementary appendix, the aggregated estimator is asymptotically normal when S = O(Nl) for 0 < l < 4/13. This is consistent with our numerical findings. Based on the results in Table 1, it can be seen that the aggregated estimator θ̂0 has much smaller standard deviation than the pooled estimator θ̃0, indicating the efficiency gain by the divide and conquer scheme as shown in our theory. In addition, the bias of θ̂0 generally becomes bigger and the standard deviation of θ̂0 generally becomes smaller when S and N increase, and the normal approximation becomes more accurate when S increases. This demonstrates the bias-variance trade off for aggregated estimators. With properly chosen S, the estimated standard error of θ̂0 is close to its standard deviation and the coverage probability is close to the nominal level.

4.2 Maximum score estimator

Consider the model Yi = 1.5Xi1 −1.5Xi2 +0.5ei, i = 1, ···, N, where Xi1, Xi2 and ei were generated independently from the standard normal. Hence, θ0 = (θ1, θ2)T = (1.5,−1.5)T. Let θ̃0 = (θ̃1, θ̃2)T denote the pooled estimator and θ̂0 = (θ̂1, θ̂2)T the aggregated estimator. We set N = 220, 222 and S = 2j such that 0.18 ≤ j/i ≤ 0.42 when N = 2i. The coverage probabilities of 95% confidence intervals for θ̂1 and θ̂2 are plotted in Figure 2 based on 1000 replications. They are close to the nominal level when SS* where S* ≈ N0.32. This example can also be regarded as a one-dimensional cubic rate estimation problem since θ̂1 and θ̂2 satisfy the constraint: θ^12+θ^22=1. Therefore, similar to the discussions in Section A.5.2, we can show θ̂1 and θ̂2 are asymptotically normal when S = O(Nl) for 0 < l < 4/13. This upper bound is close to S* since 4/13 ≈ 0.308. Other results are given in Table 2 of the supplementary appendix. The findings are very similar to those for the location estimator in the previous example.

Figure 2.

Figure 2

Coverage probability of 95% predictive interval with different choices of N and S, for the maximum score estimator

4.3 Value search estimator

Consider the model Yi = 1+Ai(2Xi − 1) + ei, i = 1, ···, N, where Xi ~ N(0, 1), ei ~ N(0, 0.25), and Pr(Aj = 1) = 0.5. Under this model assumption, the optimal treatment rule takes the form,

dopt(x)=I(2x>1),

and hence β = 2.

We take N = 224, 225, 226 and 227. When N = 224 and 225, we choose S = 2j for j = 4, 5, 6, 7. When N = 226 and 227, we choose S = 2j for j = 5, 6, 7, 8. This gives a total of 16 scenarios. We plot the coverage probabilities of 95% predictive intervals for θ̂0 in Figure 3, with these combinations of S and N. When SS*N0.27, the coverage probabilities are close to 95%. This is also a one-dimensional problem. Note that in this application, the rate 0.27 in the practical upper bound is slightly smaller than the theoretical upper bound 4/13 ≈ 0.308. However, it is noted that the theoretical upper bound is up to a scaling constant. When N becomes larger, the ratio log S*/log N should be close to or larger than 0.308. Details about the bias and the sample standard deviations of the aggregated estimator are given in Table 3 of the supplementary appendix.

Figure 3.

Figure 3

Coverage probability of 95% predictive interval with different choices of N and S, for the value search estimator

4.4 Yahoo! Today Module user click log dataset

Online content recommendation services have received extensive attention both in the machine learning and statistics literature. These online services strive to make recommendations of advertisements or news articles to individual users by making use of both the content and user information. In this subsection, we apply the proposed method to a Yahoo! Today Module user click log dataset, which contains 45,811,883 user visits to the Today Module, during the first ten days in May 2009. Given such a large number of observations, it is extremely difficult to analyze the entire data on a single computer. This makes the divide and conquer method as an emerging need to deal with such large datasets.

For the ith visit, the dataset contains a binary response variable Yi, an ID of the recommended article and a 6 dimensional feature vector of the user. Due to sensitivity and privacy concerns, feature definitions and article names were not included in the data. Here, Yi = 1 means the user clicked the recommended article and Yi = 0 means the user didn’t click. The last element in the feature vector is always 1, and the first five sums to 1. Therefore, we took the first three and the fifth elements in the feature vector to form the covariates Xi. For illustration, we only consider a subset of data that contains visits on May 1st where the recommended article ID is either 109510 or 109520. There were a total of 50 candidate articles on May 1st. We chose these two articles since they were being recommended most on that day. This gives us a total of 405888 visits. On the reduced dataset, define Ai = 1 if the recommended article is 109510 and Ai = 0 otherwise. In this example, the online recommendation problem can be formulated as follows. Denoted by 𝒟 a given set of functions that maps the covariate space to the space of article ID’s. Our aim is to find the optimal recommendation strategy to maximize user’s click through rate. We consider estimating the optimal recommendation rule among the set of linear decision functions 𝒟 = {I(xT θ > 1) : ∀ θ ∈ ℝ4}. Hence, estimating the optimal recommendation strategy is similar to the problem of estimating the optimal treatment regime as described in Section 3.3. Specifically, we divide the data randomly into S pieces: {(Xi(j),Ai(j),Yi(j)):i=1,nj}j=1,,S and obtain

θ^(j)=argmaxθ41nji=1nj{(Ai(j)π^i(j)I(θTXi(j)>1)+1-Ai(j)1-π^i(j)I(βTXi(j)1))Yi(j)+(Ai(j)π^i(j)I(θTXi(j)>1)+1-Ai(j)1-π^i(j)I(θTXi(j)1)-1){h^0i(j)I(θTXi(j)1)+h^1i(j)I(θTXi(j)>1)}}, (20)

as the subgroup estimator, where π^i(j),h^0i(j),h^1i(j) are estimators of Pr(Ai(j)=1Xi(j)),Pr(Yi(j)=1Ai(j)=0,Xi(j)) and Pr(Yi(j)=1Ai(j)=1,Xi(j)) respectively. The estimators π^i(j),h^0i(j),h^1i(j) are obtained by logistic regressions. We chose nj such that maxj nj − minj nj ≤ 1. The estimated optimal recommendation strategy is given as I(xT θ̂0 > 1) where θ̂0 = Σj θ̂(j)/S.

Remark 4.1

Compared to the value search estimator defined in (18), here we obtain the subgroup estimator by maximizing an augmented version of the inverse propensity score weighted estimator. The resulting estimator also converges at a rate of n−1/3 but is more efficient than the original one in (18).

Due to data confidentiality agreement, we are not able to use the raw data. Here, we generate pseudo responses Yi(j) given Xi(j) and Ai(j) from the Yahoo data, and use the dataset Xi(j),Ai(j),Yi(j): i = 1,…,nj,j = 1,…,S} in our application. The generated variables Yi(j)’s are similar to the original responses Yi(j)’s. For example, we have i,jYi(j)/jni4.71% while i,jYi(j)/jni4.73%. Besides, under our data generating process, the population limit of θ̂ (j) in (20) can be explicitly calculated as θ0 = (θ0,1, θ0,2, θ0,3, θ0,4)T = (2.534, 2.881, 2.796, 3.200)T for any j. Hence, θ0 is also the population limit of θ̂0 when S does not diverge too fast. Detailed descriptions of generating Yi(j)’s are given in Section I of the supplementary appendix.

We choose S = 2j for j = 4, 5, …, 10. Under a given S, denoted by θ^0(S)=(θ^0,1(S),θ^0,2(S),θ^0,3(S),θ^0,4(S))T the corresponding aggregated estimator. For each S, we use sample variance to estimate the variance of the aggregated estimator. Based on these estimates, we plot the estimators β^0,i(S) and the Wald-type 95% confidence intervals of θ0,i in Figure 4, for i = 1, …,4 with different choices of S.

Figure 4.

Figure 4

95% confidence intervals of θ0,1, θ0,2, θ0,3 and θ0,4 from top to bottom and from left to right, against log(S)/log(2). Dash lines are the corresponding θ0,i’s.

It is clear from Figure 4 that the variance of θ^0(S) decreases as S increases, since the width of confidence intervals decreases as S increases. Moreover, when S is extremely large, some of the parameters are not covered in the 95% confidence intervals. For example, from the top left plot in Figure 4, θ0,1 is not covered in the confidence intervals of θ^0,1(S) when S = 29 and 210. Such phenomenon is due to the large bias of θ^0(S). These empirical results demonstrate the bias-variance trade off for the aggregated estimator, and are consistent with our theoretical findings.

5 Tail inequality for ĥ(j)

In this section, we establish tail inequalities for θ̂ (j) and ĥ (j), which are used to construct (j), a truncated version of ĥ (j) with tail equivalence.

Theorem 5.1

Under Conditions (A1)-(A5), for sufficiently large nj, there exists some constant C0, such that

Pr(θ^(j)Nδ)2exp(-C0nj). (21)

Moreover, for sufficiently large nj, there exist some constants C1, C2 > 0 and N0 ≥ 2, such that

Pr(h^(j)2)xθ^(j)NδC2exp(-C1x3), (22)

for any N0xnj1/3δ.

Remark 5.1

(21) and (22) can be viewed as generalization of the consistency and rate of convergence results established for cube root estimators (cf. Corollary 4.2 in Kim and Pollard, 1990). The tail probability of ||ĥ (j)||2 is obtained based on the subexponential tail Assumption (A3) for m(·, θ).

We represent ĥ(j) as

h^(j)=argmaxhHnjMnj,j(h)argmaxhHnj{nj1/6Gnj(j)(mh(j))+nj2/3E(mh(j))},

where Hnj={hd:nj-1/3h+θ0Θ},Gnj(j)=nj1/2(nj(j)-E) and mh(j)(·)=m(·,θ0+nj-1/3h)-m(·,θ0). Similarly define

h(j)=argmaxhHnjHδnMnj,j(h)=argmaxhHnjHδn{nj1/6Gn(j)(mh(j))+nj2/3E(mh(j))},

where Hδn = {h : ||h||2δn}. By its definition, we have ||(j)||2δn. The following Corollaries are immediate applications of Theorem 5.1.

Corollary 5.1

Assume δnnj1/3δ. Under Conditions (A1)-(A5), for sufficiently large nj, there exist some constants N0 ≥ 2, C4 and C5, such that

Pr(h(j)2>x)C5exp(-C4x3),xN0. (23)

The proof is straightforward by noting that for any xnj1/3δ,

Pr(h(j)2>x)Pr(h(j)2>xθ^(j)Nδ)Pr(θ^(j)Nδ)+Pr(θ^(j)Nδ)C2exp(-C1x3)+2exp(-C0nj)C5exp(-C4x3).

Remark 5.2

Corollary 5.1 suggests that h̃(j) has finite moments of all orders. For any a ∈ ℝd and positive integer k, this implies that the sequence of random variables |aT(j)|k are uniformly integrable. This result is useful in establishing the convergence for moments of h̃(j) (see Corollary 5.3).

Corollary 5.2

Under Conditions (A1)-(A5) and (A8), taking δn=max(31/3,31/3/C11/3)log1/3nj where C1 is defined in Theorem 5.1, then h̃(j) and ĥ(j) are tail equivalent. If S = o(n3), then j=1Sh(j) and j=1Sh^(j) are also tail equivalent.

Tail equivalence of (j) and ĥ(j) follows by

Pr(h(j)h^(j))=Pr(h^(j)2>δn)C2nj3+2exp(-C0nj)C2c¯3n3+2exp(-C0nc¯), (24)

where the first inequality is implied by Theorem 5.1 and the last inequality is due to Condition (A8). The second assertion follows by an application of Bonferroni’s inequality.

Corollary 5.2 proves (7). From now on, we take δnj=max(31/3,31/3/C11/3)log1/3nj. By (24), Slutsky’s Theorem implies h(j)dh0. Applying Skorohod’s representation Theorem (cf. Section 9.4 in Athreya and Lahiri, 2006), we have that there exist random vectors h(j)=dh(j) and h0=dh0 such that h(j)h0, almost surely. This together with the uniform integrability of h(j)2k gives the following Corollary.

Corollary 5.3

Under Conditions (A1)-(A5), for any a ∈ ℝd and integer k ≥ 1, we have E{(aT(j))k} → E{(aT h0)k} as nj → ∞.

Remark 5.3

Due to the i.i.d assumption of Xi(j), E{(aT(j))k} is a function of nj only. Under Condition (A8), Corollary (5.3) implies

supjE{(aTh(j))k}-E{(aTh0)k}0,asn.

Taking k = 2, it proves (9). Taking k = 3, it proves (10). Moreover, Corollary 5.3 suggests a simple scheme for estimating the covariance matrix A ≡ cov(h0) given in (5). For any vector a, by law of large numbers, we obtain

1Sj=1S(aTh(j))2-1Sj=1SE(aTh(j))2a.s.0.

This together with tail equivalence between h̃(j) and ĥ(j), and (9) implies that Σj(aTĥ(j))2/S converges to aTAa.

6 Discussion

In this paper, we provide a general inference framework for aggregated M-estimators with cubic rates obtained by the divide and conquer method. Our results demonstrate that the aggregated estimators have faster convergence rate than the original M-estimators based on pooled data and achieve the asymptotic normality when the number of groups S does not grow too fast with respect to n, the average sample size of each group.

6.1 Rate of the bias

For a general cubic-rate estimator with sample size n, we showed its bias can be bounded by O((n/log n)−5/12). In comparison, Banerjee et al. (2016) obtained a sharper bound in the specific setting of monotone regression and showed that the bias of the isotonic estimator can be bounded by O(n−7/15+ζ) for any ζ > 0 and the bias of its inverse bounded by o(n−1/2) (see Theorem 4.3 and 4.4 in that paper). As commented before, this is because we work on a more general setting and their techniques cannot be easily generalized to other cubic-rate M-estimation problems.

However, it is possible to sharpen the bound for some special cases. In particular, when the parameter is one-dimensional, we show in Theorem A.1 (see also Corollary A.1) in the supplementary appendix that the bias of the estimator can be bounded by O(n−5/9 log9/14 n) based on the KMT approximation. Note that this bound is even sharper than those in Theorem 4.3 and 4.4 in Banerjee et al. (2016). This is because we assume a stronger assumption on the Lipschitz continuity of the Hessian matrix (see Assumption A6 and Equation (4.3) in Banerjee et al., 2016). Under Assumption A8, Theorem A.1 implies the asymptotic normality holds for the aggregated estimator as long as the number of machines satisfies S = O(Nl) for some l < 4/13, where N is the total number of observations. Again this upper bound on S may still be conservative, however it improves a lot compared to Theorem 2.1. We further apply our theorem to the location estimator (see Section A.5.1) and the one-dimensional value search estimator (see Section A.5.2) for illustration.

For the bias of a general cubic-rate M-estimator, our proof relies on the Gaussian approximation of the suprema of empirical processes (cf. Chernozhukov et al., 2013, 2014) and the Sudakov-Fernique type error bound (Chatterjee, 2005). The proofs for these theorems are based on smooth approximation of the supremum function. It remains unknown whether the rates of these error bounds are optimal and whether they can be improved using other techniques. This is an interesting problem that needs further investigation.

6.2 The super-efficiency phenomenon

In the context of isotonic regression, Banerjee et al. (2016) showed that the faster convergence rate of the aggregated estimator of the inverse function for a fixed model comes at a price, that is, the maximal risk over a class of models in a neighborhood of the given model remains bounded for the pooled estimator but diverges to infinity for the aggregated estimator (see Theorem 6.1 in Banerjee et al., 2016). This is referred to as the super-efficiency phenomenon, which is seen in nonparametric function estimation as well (cf. Brown et al., 1997).

We believe such super-efficiency phenomenon holds for many other cubic-rate M-estimation problems as well. In the supplementary appendix, we mathematically formalize the notion of the super-efficiency phenomenon for general M-estimation problems, and establish such phenomenon for the location estimator (see Section B.1) and the value search estimator (see Section B.2). The super-efficiency phenomenon is essentially due to that the maximal bias over a large class of models for the aggregated estimator will diverge to infinity. We suspect this is because the condition on the Lipschitz continuity of the Hessian matrix (Assumption A6) cannot hold uniformly for all models in such a class. We discuss this in details in the supplementary appendix.

6.3 Other issues

In the current setup, we assume all Xi(j)’s are independently and identically distributed. It will be interesting to generalize Theorem 2.1 to the setting where Xi(j)’s are independent, but not identically distributed. However, the meaning of the aggregated estimator may become unclear in some applications, such as the value search estimator, and the derivation of the asymptotic properties of the resulting aggregated estimator becomes much more involved. This needs further investigation.

Supplementary Material

supp

Acknowledgments

The authors would like to thank an associate editor and three referees for their thoughtful and constructive comments, which help to improve an earlier version of the paper. This work was partly supported by a NIH grant P01 CA142538.

References

  1. Athreya KB, Lahiri SN. Springer Texts in Statistics. Springer; New York: 2006. Measure theory and probability theory. [Google Scholar]
  2. Banerjee M, Durot C, Sen B. Divide and conquer in non-standard problems and the super-efficiency phenomenon. 2016 arXiv preprint arXiv:1605.04446. [Google Scholar]
  3. Brown LD, Low MG, Zhao LH. Superefficiency in nonparametric function estimation. Ann Statist. 1997;25(6):2607–2625. [Google Scholar]
  4. Chatterjee S. An error bound in the sudakov-fernique inequality. 2005 arXiv preprint math/0510424. [Google Scholar]
  5. Chen X, Xie M-g. A split-and-conquer approach for analysis of extraordinarily large data. Statist Sinica. 2014;24(4):1655–1684. [Google Scholar]
  6. Chernoff H. Estimation of the mode. Ann Inst Statist Math. 1964;16:31–41. [Google Scholar]
  7. Chernozhukov V, Chetverikov D, Kato K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Statist. 2013;41(6):2786–2819. [Google Scholar]
  8. Chernozhukov V, Chetverikov D, Kato K. Gaussian approximation of suprema of empirical processes. Ann Statist. 2014;42(4):1564–1597. [Google Scholar]
  9. Csörgő M, Csörgő S, Horváth L, Révész P. On weak and strong approximations of the quantile process. Proceedings of the seventh conference on probability theory; Braşov. 1982; Utrecht: VNU Sci. Press; 1985. pp. 81–95. [Google Scholar]
  10. Durot C. Sharp asymptotics for isotonic regression. Probab Theory Related Fields. 2002;122(2):222–240. [Google Scholar]
  11. Geyer CJ. On the asymptotics of constrained M-estimation. Ann Statist. 1994;22(4):1993–2010. [Google Scholar]
  12. Groeneboom P, Hooghiemstra G, Lopuhaä HP. Asymptotic normality of the L1 error of the Grenander estimator. Ann Statist. 1999;27(4):1316–1347. [Google Scholar]
  13. Kim J, Pollard D. Cube root asymptotics. Ann Statist. 1990;18(1):191–219. [Google Scholar]
  14. Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. J R Stat Soc Ser B Stat Methodol. 2014;76(4):795–816. [Google Scholar]
  15. Koltchinskii VI. Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions. J Theoret Probab. 1994;7(1):73–118. [Google Scholar]
  16. Komlós J, Major P, Tusnády G. An approximation of partial sums of independent RV’s and the sample DF. I. Z Wahrscheinlichkeitstheorie und Verw Gebiete. 1975;32:111–131. [Google Scholar]
  17. Kosorok MR. Springer Series in Statistics. Springer; New York: 2008. Introduction to empirical processes and semiparametric inference. [Google Scholar]
  18. Rio E. Local invariance principles and their application to density estimation. Probab Theory Related Fields. 1994;98(1):21–45. [Google Scholar]
  19. Sakhanenko AI. Estimates in the invariance principle in terms of truncated power moments. Sibirsk Mat Zh. 2006;47(6):1355–1371. [Google Scholar]
  20. Splawa-Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist Sci. 1990;5(4):465–472. [Google Scholar]
  21. van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. [Google Scholar]
  22. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression. Conference on Learning Theory; 2013. pp. 592–617. [Google Scholar]
  24. Zhao T, Cheng G, Liu H. A partially linear framework for massive heterogeneous data. Ann Statist. 2016;44(4):1400–1437. doi: 10.1214/15-AOS1410. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

RESOURCES