A Massive Data Framework for M-Estimators with Cubic-Rate

Chengchun Shi; Wenbin Lu; Rui Song

doi:10.1080/01621459.2017.1360779

. Author manuscript; available in PMC: 2019 Jun 19.

Published in final edited form as: J Am Stat Assoc. 2018 Jun 19;113(524):1698–1709. doi: 10.1080/01621459.2017.1360779

A Massive Data Framework for M-Estimators with Cubic-Rate

Chengchun Shi ¹, Wenbin Lu ¹, Rui Song ^1,^✉

PMCID: PMC6364750 NIHMSID: NIHMS915728 PMID: 30739966

Abstract

The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing the asymptotic distribution of the aggregated M-estimators using a weighted average with weights depending on the subgroup sample sizes. Under certain condition on the growing rate of the number of subgroups, the resulting aggregated estimators are shown to have faster convergence rate and asymptotic normal distribution, which are more tractable in both computation and inference than the original M-estimators based on pooled data. Our theory applies to a wide class of M-estimators with cube root convergence rate, including the location estimator, maximum score estimator and value search estimator. Empirical performance via simulations and a real data application also validate our theoretical findings.

Keywords: Cubic rate asymptotics, divide and conquer, M-estimators, massive data

1 Introduction

In a world of explosively large data, effective estimation procedures are needed to deal with the computational challenge arisen from analysis of massive data. The divide and conquer method is a commonly used approach for handling massive data, which divides data into several groups and aggregate all subgroup estimators by a simple average to lessen the computational burden. A number of problems have been studied for the divide and conquer method, including variable selection (Chen and Xie, 2014), nonparametric regression (Zhang et al., 2013; Zhao et al., 2016) and bootstrap inference (Kleiner et al., 2014), to mention a few. Most papers establish that the aggregated estimators achieve the oracle result, in the sense that they possess the same nonasymptotic error bounds or limiting distributions as the pooled estimators, which are obtained by fitting all the data in a single model. This implies that the divide and conquer scheme can not only maintain efficiency, but also obtain a feasible solution for analyzing massive data.

In addition to the computational advantages for handling massive data, the divide and conquer method, somewhat surprisingly, can lead to aggregated estimators with improved efficiency over pooled estimators with slower than the usual n^1/2 convergence rate. A recent independent work of Banerjee et al. (2016) studied the divide and conquer principle in the monotone regression setting where the estimator converges at n^1/3 rate. In particular, they showed the aggregated estimator obtained by averaging all subgroup estimators converges much faster than the pooled estimator based on all observations and is asymptotically normal. This phenomenon is expected to hold under many other cube-root estimation problems. For example, Chernoff (1964) studied a cubic-rate estimator for estimating the mode. It was shown therein that the estimator converges in distribution to the argmax of a Brownian motion minus a quadratic drift. Kim and Pollard (1990) systematically studied a class of cubic-rate M-estimators and established their limiting distributions as the argmax of a general Gaussian process minus a quadratic form. These results were extended to a more general class of M-estimators using modern empirical process results (van der Vaart and Wellner, 1996; Kosorok, 2008). In this paper, we study a class of M-estimators with cubic-rate and develop a general inference framework for the aggregated estimators obtained by the divide and conquer method. Our theory states that the aggregated estimators can achieve a faster convergence rate than the pooled estimators and have asymptotic normal distributions when the number of groups diverges at a proper rate as the sample size of each group grows. This enables a simple way for estimating the covariance matrix of the aggregated estimators.

When establishing the asymptotic properties of the aggregated estimators, a major technical challenge is to quantify the accumulated bias. Different from estimators with standard n^1/2 convergence rate, M-estimators with n^1/3 convergence rate generally do not have a nice linearization representation and the magnitude of the associated biases is difficult to quantify. One way to obtain the magnitude of the bias is by establishing a coupling inequality for the cubic-rate estimator. For example, Banerjee et al. (2016) derived a nonasymptotic bound for the biases of the isotonic estimator in a monotone regression model and its inverse, based on the coupling inequality of the isotonic estimator (see Lemma 8.10 in that paper, and also Equation (29) in Durot (2002)). Groeneboom et al. (1999) provided a coupling inequality for the inverse process of the Grenander estimator. Their results can be used to establish the bias of the Grenander estimator. While such strategy is useful for studying the bias of some one-dimensional cubic-rate estimators, it is not suitable for multi-dimensional estimators. On one hand, these coupling inequalities are all based on Komos-Major-Tusnady (KMT) approximation (Komlós et al., 1975) and its extensions (cf. Csörgő et al., 1985; Sakhanenko, 2006) that only apply to the empirical distribution or the quantile process. There are extensions of the KMT approximation for more general empirical process (cf. Rio, 1994; Koltchinskii, 1994). However, the rate of the approximation will depend on the dimension of the parameter and decays fast as the dimension increases. On the other hand, proofs of these coupling inequalities all rely on the properties of the argmax of a Brownian motion process with a parabolic drift (cf. Proposition 1 in Durot (2002) and the discussions therein), and are not applicable to cubic-rate estimators that converge to the argmax of a more general Gaussian process minus a quadratic term. Here, we propose a novel approach to derive an upper bound for the bias, without establishing the coupling inequalities. To the best of our knowledge, this is the first time that a nonasymptotic error bound for the bias of a general cubic-rate estimator is provided.

A key innovation in our analysis is to introduce a linear perturbation in the empirical objective function. In that way, we transform the problem of quantifying the bias into comparison of the expected supremum of the empirical objective function and that of its limiting Gaussian process. To bound the difference of these expected suprema, we adopt similar techniques that have been recently studied by Chernozhukov et al. (2013) and Chernozhukov et al. (2014). Specifically, they compared a function of the maximum for sum of mean-zero Gaussian random vectors with that of multivariate mean-zero random vectors with the same covariance function, and provided an associated coupling inequality. We improve their arguments by providing more accurate approximation results (Lemma A.3) for the identity function of maximums as needed in our applications.

Another major contribution of this paper is to provide a tail inequality for cubic-rate M-estimators (Theorem 5.1). This helps us to construct a truncated estimator with bounded second moment, which is essential to apply Lyapunov’s central limit theorem for establishing the normality of the aggregated estimator. Under some additional tail assumptions on the underlying empirical process, our results can be viewed as a generalization of empirical process theories that establish consistency and n^1/3 convergence rate for the M-estimators. Based on the results, we show that the asymptotic variance of the aggregated estimator can be consistently estimated by the sample variance of individual M-estimators in each group, which largely simplifies the inference procedure for M-estimators.

The rest of the paper is organized as follows. We describe the divide and conquer method for M-estimators and state the major central limit theorem (Theorem 2.1) in Section 2. Three examples for the location estimator, maximum score estimator and value search estimator are presented in Section 3 to illustrate the application of Theorem 2.1. In Section 4, we demonstrate the empirical performance of the aggregated estimators using both simulation studies and an application to the Yahoo! Front Page Today Module user click log dataset. Section 5 studies a tail inequality that are needed to prove Theorem 2.1, followed by a Discussion Section. All the technical proofs are provided in the Appendix.

2 Method

The divide and conquer scheme for M-estimators is described as follows. In the first step, the data are randomly divided into several groups. For the jth group, consider the following M-estimator

{\hat{θ}}^{(j)} = arg max_{θ \in Θ} ℙ_{n_{j}}^{(j)} m (\cdot, θ) \equiv arg max_{θ \in Θ} \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} m (X_{i}^{(j)}, θ), j = 1, \dots, S,

where ( $X_{1}^{(j)}, \dots, X_{n_{j}}^{(j)}$ ) denote the data for the jth group, n_j is the number of observations in the jth group, S is the number of groups, m(·, ·) is the objective function and θ is a d-dimensional vector of parameters that belong to a compact parameter space Θ. In the second step, the aggregated estimator θ̂₀ is obtained as a weighted average of all subgroup estimators,

{\hat{θ}}_{0} = \sum_{j = 1}^{S} ω_{j} {\hat{θ}}^{(j)} = \frac{\sum_{j = 1}^{S} n_{j}^{2 / 3} {\hat{θ}}^{(j)}}{\sum_{j = 1}^{S} n_{j}^{2 / 3}} .

(1)

Remark 2.1

The weights ω_j’s are chosen such that θ̂₀ achieves the smallest asymptotic covariance matrix among the class of linearly aggregated estimators {θ_ω = Σ_j ω_j^{θ̂^(j)}|Σ_j ω_j = 1, ω_j ≥ 0, ∀j = 1, … , S} (see Section F in the supplementary appendix for detailed illustrations). When n₁ = n₂ = · · · = n_S, θ̂₀ reduces to a simple average of all θ̂⁽^j⁾’s.

We assume that all the $X_{i}^{(j)}$ ’s are independent and identically distributed across i and j. Here, we only consider M-estimation with non-smooth functions m(·, θ) of θ, and the resulting M-estimators θ̂⁽^j⁾’s have a convergence rate of $O_{p} (n_{j}^{- 1 / 3})$ . Such cubic-rate M-estimators have been widely studied in the literature, for example, the location estimator and maximum score estimator as demonstrated in the next section. Define N =Σ_j n_j and n = N/S. The main goal of this paper is to establish the convergence rate and asymptotic normality of θ̂₀ under suitable conditions for S and n_j ’s.

Before introducing our main results, we first provide an intuitive explanation here why the divide and conquer method can improve the efficiency in cubic-rate M-estimation problems. Assume for now, n₁ = n₂ = · · · = n_S = n and S is fixed. Following Kim and Pollard (1990), we can show that

n^{1 / 3} ({\hat{θ}}^{(j)} - θ_{0}) \overset{d}{\to} h_{0}, N^{1 / 3} ({\tilde{θ}}_{0} - θ_{0}) \overset{d}{\to} h_{0},

where θ̃₀ is the pooled estimator, i.e, ${\tilde{θ}}_{0} = arg {max}_{θ \in Θ} \sum_{i, j} m (X_{i}^{(j)}, θ)$ , θ₀ is the unique maximizer of E{m(·, θ)} and h₀ = argmax_h Z(h) with

Z (h) = G (h) - \frac{1}{2} h^{T} V h .

(2)

Here G is a mean-zero Gaussian process and V = ∂²E{m(·, θ)}/∂θθ^T |_θ=θ₀ is a positive definite matrix.

Assume ${‖ N^{1 / 3} ({\tilde{θ}}_{0} - θ_{0}) ‖}_{2}^{2}$ and ${‖ n^{1 / 3} ({\hat{θ}}^{(j)} - θ_{0}) ‖}_{2}^{2}$ are uniformly integrable. Then, we have

N^{2 / 3} E ({\tilde{θ}}_{0} - θ_{0}) {({\tilde{θ}}_{0} - θ_{0})}^{T} \to cov (h_{0}), as N \to \infty n^{2 / 3} E ({\hat{θ}}^{(j)} - θ_{0}) {({\hat{θ}}^{(j)} - θ_{0})}^{T} \to cov (h_{0}), as N \to \infty .

(3)

Under equal allocation, θ̂⁽^j⁾’s are independent and identical. We have

N^{2 / 3} E {({\hat{θ}}_{0} - θ_{0}) {({\hat{θ}}_{0} - θ_{0})}^{T}} = N^{2 / 3} \frac{1}{S^{2}} \sum_{j = 1}^{S} E {({\hat{θ}}^{(j)} - θ_{0}) {({\hat{θ}}^{(j)} - θ_{0})}^{T}} + N^{2 / 3} \frac{1}{S^{2}} \sum_{j \neq k} E {({\hat{θ}}^{(j)} - θ_{0}) E {({\hat{θ}}^{(k)} - θ_{0})}^{T}} = \frac{n^{2 / 3}}{S^{1 / 3}} E {({\hat{θ}}^{(1)} - θ_{0}) {({\hat{θ}}^{(1)} - θ_{0})}^{T}} + b_{n} b_{n}^{T} S^{2 / 3} (S - 1) / S \to S^{- 1 / 3} cov (h_{0}),

(4)

where b_n = n^1/3E(θ̂⁽^j⁾ − θ₀) = o(1) is the bias of n^1/3θ̂⁽^j⁾. Comparing (3) with (4), we can see that the aggregated estimator is more efficient than the pooled estimator in the fixed S scenario.

Now let S grow with N. As long as S satisfies $S = O (1 / ({‖ b_{n} ‖}_{2}^{2}))$ , we have

b_{n} b_{n}^{T} S^{2 / 3} (S - 1) / S \to O (S^{- 1 / 3}),

and hence N^2/3E{(θ̂₀ − θ₀)(θ̂₀ − θ₀)^T } = O(S^−1/3). In view of (3), this implies that the aggregated estimator can have a faster convergence rate than the pooled estimator.

2.1 Main results

We assume the dimension d is fixed, while the number of groups S →∞ as N →∞. Let || · ||₂ denote the Euclidean norm for vectors or induced matrix L₂ norm for matrices. We first introduce some conditions.

(A1.)
There exists a small neighborhood N_δ = {θ : ||θ − θ₀||₂ ≤ δ} in which Em{(·, θ)} is twice continuously differentiable with the Hessian matrix −V (θ), where V (θ) is positive definite in N_δ. Moreover, assume $E {m (\cdot, θ_{0})} > {sup}_{θ \in N_{δ}^{c}} E {m (\cdot, θ)}$ .
(A2.)
For any θ₁, θ₂ ∈ N_δ, we have E{|m(·, θ₁) − m(·, θ₂)|²} ≤ K||θ₁ − θ₂||₂ for a constant K that is independent of θ₁ and θ₂.
(A3.)
There exists some positive constant ω such that |m(x, θ)| ≤ ω for all x and θ.
(A4.)
The envelope function M_R(·) ≡ sup_θ {|m(·, θ)| : ||θ − θ₀||₂ ≤ R} satisfies $E M_{R}^{2} = O (R)$ when R ≤ δ.
(A5.)
The set of functions {m(·, θ)|θ ∈ Θ} has Vapnik-Chervonenkis (VC) index 1 ≤ v < ∞.
(A6.)
For any θ ∈ N_δ, ||V (θ) − V ||₂ = O(||θ − θ₀||₂), where V = V (θ₀).
(A7.)
Let L(·) denote the variance process of G(·) satisfying L(h) > 0 whenever h = 0. (i) The function L(·) is symmetric and continuous, and has the rescaling property: L(kh) = kL(h) for k > 0. (ii) For any h₁, h₂ ∈ ℝ^d satisfying ||h₁||₂ ≤ n^1/3δ and ||h₂||₂ ≤ n^1/3δ, we have
$| L (h_{1} - h_{2}) - n^{1 / 3} E {m (\cdot, θ_{0} + n^{- 1 / 3} h_{1}) - m (\cdot, θ_{0} + n^{- 1 / 3} h_{2})}^{2} | = O (\frac{{(‖ h_{1} ‖ + ‖ h_{2} ‖)}^{2}}{n^{1 / 3}}) .$
(A8.)
Let c_j = n_j/n. Assume there exists some constant c̄ > 1 such that 1/c̄ ≤ c_j ≤ c̄ for all j.

Theorem 2.1

Under Conditions (A1)–(A8), if S = o(n^1/6/log^5/6 n) and S →∞ as n → ∞, we have

\sqrt{c_{1}^{2 / 3} + \dots + c_{S}^{2 / 3}} n^{1 / 3} ({\hat{θ}}_{0} - θ_{0}) \overset{d}{\to} N (0, A),

(5)

for some positive definite matrix A.

Remark 2.2

Under Condition A8, Theorem 2.1 suggests that θ̂₀ converges at a rate of O_p(S^−1/2n^−1/3). In contrast, the original M-estimator obtained based on pooled data has a convergence rate of O_p(S^−1/3n^−1/3). This implies that we can gain efficiency by adopting the split and conquer scheme for cubic-rate M-estimators. Such result is interesting as most aggregated estimators in the divide and conquer literature share the same convergence rates as the original estimators based on pooled data.

Remark 2.3

The constraints on S suggest that the number of group cannot diverge too fast. A main reason as we showed in the proof of Theorem 2.1 is that if S grows too fast, the asymptotic normality of θ̂₀ will fail due to accumulation of bias in the aggregation of subgroup estimators. Given a data of size N, we can take S ≈ N^l, n = N/S ≈ N¹⁻^l with l < 1/7 to fulfill this requirement. It turns out that this requirement on S can be relaxed under some special cases. In particular, when d = 1, i.e, θ₀ is a scalar, we show in the supplementary appendix that the aggregated estimator is asymptotically normal as long as S ≤ N^l with l < 4/13. Details can be found in Section A.5 of the supplementary appendix.

Remark 2.4

Conditions A1 – A5 and A7 (i) are similar to those in Kim and Pollard (1990) and are used to establish the cubic-rate convergence of the M-estimator in each group. Conditions A6 and A7 (ii) are used to establish the normality of the aggregated estimator. In particular, Condition A7 (ii) implies that the Gaussian process G(·) has stationary increments, i.e. E[{G(h₁) − G(h₂)}²] = L(h₁ − h₂) for any h₁, h₂ ∈ ℝ^d, which is used to control the bias of the aggregated estimator. Condition A8 automatically holds when n₁ = · · · = n_S.

In the rest of this section, we give a sketch for the proof of Theorem 2.1. The details of the proof are given in Section 5 and Section A in the supplementary appendix. Let ${\hat{h}}^{(j)} = n_{j}^{1 / 3} ({\hat{θ}}^{(j)} - θ_{0})$ . By definition, it is equivalent to show

\frac{1}{\sqrt{c_{1}^{2 / 3} + \dots + c_{S}^{2 / 3}}} \sum_{j = 1}^{S} c_{j}^{1 / 3} {\hat{h}}^{(j)} \overset{d}{\to} N (0, A) .

(6)

When S diverges, intuitively, (6) follows by a direct application of central limit theorem for triangular arrays (cf. Theorem 11.1.1, Athreya and Lahiri, 2006). However, a few challenges remain. First, the estimator ĥ⁽^j⁾ may not possess finite second moment. Analogous to Kolmogorov’s 3-series theorem (cf. Theorem 8.3.5, Athreya and Lahiri, 2006), we handle this by first defining h̃⁽^j⁾, which is a truncated version of ĥ⁽^j⁾ with ||h̃⁽^j⁾||₂ ≤ δ_{n_j} for some δ_{n_j} > 0, such that Σ_jĥ⁽^j⁾ and Σ_jh̃⁽^j⁾ are tail equivalent, i.e.

lim_{k} Pr (\underset{n \geq k}{\cap} {\sum_{j = 1}^{S (n)} c_{j}^{1 / 3} {\hat{h}}^{(j)} = \sum_{j = 1}^{S (n)} c_{j}^{1 / 3} {\tilde{h}}^{(j)}}) = 1.

Using Borel-Cantelli lemma, it suffices to show

\sum_{n} Pr (\sum_{j = 1}^{S (n)} c_{j}^{1 / 3} {\hat{h}}^{(j)} \neq \sum_{j = 1}^{S (n)} c_{j}^{1 / 3} {\tilde{h}}^{(j)}) < \infty .

(7)

Now it remains to show

\frac{1}{\sqrt{\sum_{j} c_{j}^{2 / 3}}} \sum_{j = 1}^{S} c_{j}^{1 / 3} {\tilde{h}}^{(j)} = \frac{1}{\sqrt{\sum_{j} c_{j}^{2 / 3}}} \sum_{j = 1}^{S} {{\tilde{h}}^{(j)} - E ({\tilde{h}}^{(j)})} + \frac{1}{\sqrt{\sum_{j} c_{j}^{2 / 3}}} \sum_{j} E c_{j}^{1 / 3} {\tilde{h}}^{(j)} \overset{d}{\to} N (0, A) .

The second challenge is to control the accumulated bias in the aggregated estimator, i.e. showing

\frac{1}{\sqrt{\sum_{j} c_{j}^{2 / 3}}} \sum_{j} c_{j}^{1 / 3} E ({\tilde{h}}^{(j)}) \to 0,

\sqrt{S} sup_{j} ∣ E ({\tilde{h}}^{(j)}) ∣ \to 0,

(8)

by Assumption A8. Finally, it remains to show that the second and third moments of h̃⁽^j⁾ satisfies

sup_{j} ∣ E {(a^{T} {\tilde{h}}^{(j)})}^{2} - a^{T} A a ∣ \to 0,

(9)

sup_{j} E {‖ {\tilde{h}}^{(j)} ‖}_{2}^{3} < \infty,

(10)

for any a ∈ ℝ^d. When (7), (8), (9) and (10) are established, Theorem 2.1 follows by Lyapunov’s central limit theorem (cf. Corollary 11.1.4 Athreya and Lahiri, 2006). Section 5 is devoted to verifying (7), (9) and (10), while Section A in the supplementary appendix is devoted to proving (8).

3 Applications

In this section, we illustrate our main theorem (Theorem 2.1) with three applications including simple one-dimensional location estimator (Example 3.1) and more complicated multi-dimensional estimators with some constraints, such as maximum score estimator (Example 3.2) and value-search estimator (Example 3.3).

3.1 Location estimator

Let $X_{i}^{(j)} (i = 1, \dots, n; j = 1, \dots, S)$ be i.i.d. random variables on the real line, with a continuous density p. In each subgroup j, consider the location estimator

{\hat{θ}}^{(j)} = arg max_{θ \in ℝ} \frac{1}{n} \sum_{i = 1}^{n} I (θ - 1 \leq X_{i} \leq θ + 1) .

It was shown in Example 3.2.13 of van der Vaart and Wellner (1996) and Example 6.1 of Kim and Pollard (1990) that each θ̂⁽^j⁾ has a cubic-rate convergence. We assume that Pr(X ∈ [θ − 1, θ + 1]) has a unique maximizer at θ₀. When the derivative of p exists and is continuous, p′(θ₀ − 1) − p′(θ₀ + 1) > 0 implies that the second derivative of Pr(X ∈ [θ − 1, θ + 1]) is negative for all θ within some small neighbor N_δ around θ₀. Therefore, Condition (A2) holds, since

E {∣ I (θ_{1} - 1 \leq X \leq θ_{1} + 1) - I (θ_{2} - 1 \leq X \leq θ_{2} + 1) ∣}^{2} = Pr (θ_{1} - 1 \leq X \leq θ_{2} - 1) + Pr (θ_{1} + 1 \leq X \leq θ_{2} + 1) \leq sup_{θ \in N_{δ}} {p^{'} (θ - 1) + p^{'} (θ + 1)} ∣ θ_{1} - θ_{2} ∣,

for θ₁ ≤ θ₂ and |θ₁−θ₂| < 0.5. Moreover, if we further assume that p has continuous second derivative in the neighborhood N_δ, Condition (A6) is satisfied.

The class of functions {|I(θ −1 ≤ X ≤ θ +1)| : θ ∈ Θ} is bounded by 1 and belongs to VC class. In addition, we have

sup_{∣ θ - θ_{0} ∣ < ε} ∣ I (θ - 1 \leq X \leq θ + 1) - I (θ_{0} - 1 \leq X \leq θ_{0} + 1) ∣ \leq I (θ_{0} - 1 - ε \leq X \leq θ_{0} - 1 + ε) + I (θ_{0} + 1 - ε \leq X \leq θ_{0} + 1 + ε),

for small ε. The L₂(P) norm of the function on the second line is $O (\sqrt{ε})$ . Hence, Conditions (A4) and (A5) hold.

Next, we claim that Condition (A7) holds for function L(h) ≡ 2p(θ₀ +1)|h|, or equivalently {p(θ₀ − 1) + p(θ₀ + 1)}|h|, since p(θ₀ − 1) = p(θ₀ + 1). Obviously, L(·) is symmetric and satisfies the rescaling property. For any h₁, h₂ such that max(|h₁|, |h₂|) ≤ n^1/3δ, we define θ₁ = θ₀ + n^−1/3h₁ ∈ N_δ and θ₂ = θ₀ + n^−1/3h₂ ∈ N_δ. Let [a, b] denote the indicator function I(a ≤ X ≤ b). Assume h₁ ≤ h₂. We have

n^{1 / 3} E {∣ [θ_{1} - 1, θ_{1} + 1] - [θ_{2} - 1, θ_{2} + 1] ∣}^{2} = n^{1 / 3} E [θ_{1} - 1, θ_{2} - 1] + n^{1 / 3} E [θ_{1} + 1, θ_{2} + 1] = n^{1 / 3} \int_{θ_{1} - 1}^{θ_{2} - 1} p (θ) d θ + n^{1 / 3} \int_{θ_{1} + 1}^{θ_{2} + 1} p (θ) d θ = {p (θ_{0} + 1) + p (θ_{0} - 1)} (h_{2} - h_{1}) + R,

where the remainder term R is bounded by

sup_{θ_{1} \leq θ \leq θ_{2}} (∣ p (θ - 1) - p (θ_{0} - 1) ∣ + ∣ p (θ + 1) - p (θ_{0} + 1) ∣) (h_{2} - h_{1}) \leq sup_{θ \in N_{δ}} 4 n^{- 1 / 3} ∣ p^{'} (θ) ∣ (h_{2} - h_{1}) max (∣ h_{1} ∣, ∣ h_{2} ∣) \leq sup_{θ \in N_{δ}} 4 n^{- 1 / 3} ∣ p^{'} (θ) ∣ {(∣ h_{1} ∣ + ∣ h_{2} ∣)}^{2},

using a first order Taylor expansion. The case when h₁ > h₂ can be similarly discussed. Therefore, Condition (A7) holds. Theorem 2.1 then follows.

3.2 Maximum score estimator

Consider the regression model $Y_{i}^{(j)} = X_{i}^{{(j)}^{T}} β_{0} + e_{i}^{(j)}$ , , j = 1, · · · , S, where $X_{i}^{(j)}$ is a d-dimensional vector of covariates and $e_{i}^{(j)}$ is the random error. Assume that ( $X_{i}^{(j)}, e_{i}^{(j)}$ )’s are i.i.d. copies of (X, e). The maximum score estimator is defined as

{\hat{β}}^{(j)} = arg max_{{‖ β ‖}_{2} = 1} \sum_{i = 1}^{n} {I (Y_{i}^{(j)} \geq 0, X_{i}^{{(j)}^{T}} β \geq 0) + (Y_{i}^{(j)} < 0, X_{i}^{{(j)}^{T}} β < 0)},

where the constraint ||β||₂ = 1 is to guarantee the uniqueness of the maximizer.

Assume ||β₀|| = 1, otherwise we can define β^★ = β₀/||β₀||₂ and establish the asymptotic distribution of β̂₀ − β^★ instead. It was shown in Example 6.4 of Kim and Pollard (1990) that β̂⁽^j⁾ has a cubic-rate convergence, when (i) median(e|X) = 0; (ii) X has a bounded, continuously differentiable density p; and (iii) the angular component of X has a bounded continuous density with respect to the surface measure on 𝒮^d⁻¹, which corresponds to the unit sphere in ℝ^d.

Theorem 2.1 is not directly applicable to this example since Assumption (A1) is violated. The Hessian matrix

V = - {\frac{\partial^{2} E {I (Y_{i}^{(j)} \geq 0, X_{i}^{{(j)}^{T}} β \geq 0) + I (Y_{i}^{(j)} < 0, X_{i}^{{(j)}^{T}} β < 0)}}{\partial β β^{T}} ∣}_{β_{0}}

is not positive definite. One possible solution is to use the arguments from the constrained M-estimator literature (e.g. Geyer, 1994) to approximate the set ||β||₂ = 1 by a hyperplane (β − β₀)^T β = 0, and obtain a version of Theorem 2.1 for the constrained cubic-rate M-estimators. We adopt an alternative approach here, and consider a simple reparameterization to make Theorem 2.1 applicable.

By Gram-Schmidt orthogonalization, we can obtain an orthogonal matrix [β₀, U₀] with U₀ being a ℝ^d^×(^d⁻¹⁾ matrix subject to the constraint $U_{0}^{T} β_{0} = 0$ . Define

β (θ) = \sqrt{1 - {‖ θ ‖}_{2}^{2}} β_{0} + U_{0} θ,

(11)

for all θ ∈ ℝ^d⁻¹ and ||θ||₂ ≤ 1. Take Θ to be the unit ball $B_{2}^{d - 1}$ in ℝ^d⁻¹. Define

{\hat{θ}}^{(j)} = arg max_{θ \in Θ} \sum_{i = 1}^{n} [I (Y_{i}^{(j)} \geq 0, X_{i}^{{(j)}^{T}} β (θ) \geq 0) + I (Y_{i}^{(j)} < 0, X_{i}^{{(j)}^{T}} β (θ) < 0)] .

Note that under the assumption median(e|X) = 0, we have θ₀ = 0.

Let m(y, x, β) = I(y ≥ 0, x^T β ≥ 0) + I(y < 0, x^Tβ < 0). Define

κ (x) = E {I (e + X^{T} β_{0} \geq 0) - I (e + X^{T} β_{0} < 0) ∣ X = x} .

It is shown in Kim and Pollard (1990) that

\frac{\partial E {m (\cdot, \cdot, β)}}{\partial β} = {‖ β ‖}_{2}^{- 2} β^{T} β_{0} (I + {‖ β ‖}_{2}^{- 2} β β^{T}) \int_{x^{T} β_{0} = 0} κ (T_{β} x) p (T_{β} x) d σ,

(12)

where

T_{β} = (I - {‖ β ‖}_{2}^{- 2} β β^{T}) (I - β_{0} β_{0}^{T}) + {‖ β ‖}_{2}^{- 1} β β_{0}^{T},

and σ is the surface measure on the line x^T β₀ = 0.

Note that ∂β(θ)/∂θ has finite derivatives for all orders as long as ||θ||₂ < 1. Assume that κ and p have twice continuous derivatives. This together with (12) implies that E{m(·, ·, β(θ))} has third continuous derivative as a function of θ in a small neighborhood N_δ (δ < 1) around 0. This verifies (A6). Moreover, for any θ₁, θ₂ ∈ N_δ with ||θ₁−θ₂||₂ ≤ ε, we have

{‖ β (θ_{1}) - β (θ_{2}) ‖}_{2}^{2} = {‖ θ_{1} - θ_{2} ‖}_{2}^{2} + {(\sqrt{1 - {‖ θ_{1} ‖}_{2}^{2}} - \sqrt{1 - {‖ θ_{2} ‖}_{2}^{2}})}^{2} = {‖ θ_{1} - θ_{2} ‖}_{2}^{2} + \frac{{(1 - {‖ θ_{1} ‖}_{2}^{2} - 1 + {‖ θ_{2} ‖}_{2}^{2})}^{2}}{{(\sqrt{1 - {‖ θ_{1} ‖}_{2}^{2}} + \sqrt{1 - {‖ θ_{2} ‖}_{2}^{2}})}^{2}} \leq \frac{2 {‖ θ_{1} - θ_{2} ‖}_{2}^{2}}{1 - δ^{2}} .

(13)

Kim and Pollard (1990) showed that E{|m(·, ·, β₁) −m(·, ·, β₂)|} = O(||β₁ − β₂||₂) near β₀. This together with (13) implies

E {{∣ m (\cdot, \cdot, β (θ_{1})) - m (\cdot, \cdot, β (θ_{2})) ∣}^{2}} \leq 2 E {∣ m (\cdot, \cdot, β (θ_{1})) - m (\cdot, \cdot, β (θ_{2})) ∣} = O ({‖ θ_{1} - θ_{2} ‖}_{2}^{2}) .

Therefore, (A2) is satisfied and (A3) trivially holds since |m| ≤ 1.

It was also shown in Kim and Pollard (1990) that the envelope M_ε of the class of functions {m(·, ·, β)−m(·, ·, β₀) : ||β −β₀||₂ ≤ ε} satisfies $E M_{ε}^{2} = O (ε)$ . Using (13), we can show that the envelope M̃_ε of the class of functions {m(·, ·, β(θ)) − m(·, ·, β₀) : ||θ||₂ ≤ ε} also satisfies $E {\tilde{M}}_{ε}^{2} = O (ε)$ . Thus, (A4) is satisfied. Moreover, since the class of functions m(·, ·, β) over all β belongs to the VC class, so does the class of function m(·, ·, β(θ)). This verifies (A5).

Finally, we establish (A7). For any θ₁, θ₂ ∈ N_δ, define h₁ = n^1/3θ₁ and h₂ = n^1/3θ₂. We have

n^{1 / 3} E {{∣ m (Y, X, β (h_{1} / n^{1 / 3})) - m (Y, X, β (h_{2} / n^{1 / 3})) ∣}^{2}} = n^{1 / 3} E {∣ I (X^{T} β (h_{1} / n^{1 / 3}) \geq 0) - I (X^{T} β (h_{2} / n^{1 / 3}) \geq 0) ∣ I (Y \geq 0)} + n^{1 / 3} E {∣ I (X^{T} β (h_{1} / n^{1 / 3}) < 0) - I (X^{T} β (h_{2} / n^{1 / 3}) < 0) ∣ I (Y < 0)} = n^{1 / 3} E {∣ I (X^{T} β (h_{1} / n^{1 / 3}) \geq 0) - I (X^{T} β (h_{2} / n^{1 / 3}) \geq 0) ∣} .

(14)

We write X as rβ₀ + z with z orthogonal to β₀. Equation (14) can be written as

n^{1 / 3} E {| I (r \sqrt{1 - {‖ \frac{h_{1}}{n^{1 / 3}} ‖}_{2}^{2}} + z^{T} U \frac{h_{1}}{n^{1 / 3}} \geq 0) - I (r \sqrt{1 - {‖ \frac{h_{2}}{n^{1 / 3}} ‖}_{2}^{2}} + z^{T} U \frac{h_{2}}{n^{1 / 3}} \geq 0) |} .

(15)

Define ω = n^1/3r. Equation (15) can be expressed as

\iint I (- z^{T} U h_{1} {(1 - n^{- 2 / 3} {‖ h_{1} ‖}_{2}^{2})}^{- 1 / 2} > ω \geq - z^{T} U h_{2} {(1 - n^{- 2 / 3} {‖ h_{2} ‖}_{2}^{2})}^{- 1 / 2}) p (\frac{ω}{n^{1 / 3}}, z) d ω d z .

Assume that p(r, z) is differentiable with respect to r and |∂p(r, z)/∂r| ≤ q(z) for some function q. Then, (15) is equal to

\int ∣ z^{T} U {h_{1} {(1 - n^{- 2 / 3} {‖ h_{1} ‖}_{2}^{2})}^{- 1 / 2} - h_{2} {(1 - n^{- 2 / 3} {‖ h_{2} ‖}_{2}^{2})}^{- 1 / 2}} ∣ p (0, z) d z + R_{1} = \int ∣ z^{T} U (h_{1} - h_{2}) ∣ p (0, z) d z + R_{1} + R_{2},

where the remainders |R₁| and |R₂| are bounded by

∣ R_{1} ∣ \leq \int n^{- 1 / 3} {{(z^{T} U h_{1})}^{2} + {(z^{T} U h_{2})}^{2}} q (z) d z = O (n^{- 1 / 3} {{‖ h_{1} ‖}_{2}^{2} + {‖ h_{2} ‖}_{2}^{2}}),

and

∣ R_{2} ∣ \leq ∣ {(1 - n^{- 2 / 3} {‖ h_{1} ‖}_{2}^{2})}^{- 1 / 2} - 1 ∣ \int ∣ z^{T} U h_{1} ∣ p (0, z) d z + ∣ {(1 - n^{- 2 / 3} {‖ h_{2} ‖}_{2}^{2})}^{- 1 / 2} - 1 ∣ \int ∣ z^{T} U h_{2} ∣ p (0, z) d z \leq n^{- 1 / 3} ({‖ h_{1} ‖}_{2} + ‖ h_{2} ‖) \int (∣ z^{T} U h_{1} ∣ + ∣ z^{T} U h_{2} ∣) p (0, z) d z = O (n^{- 1 / 3} {{‖ h_{1} ‖}_{2}^{2} + {‖ h_{2} ‖}_{2}^{2}}),

under suitable moment assumptions on functions p(0, z) and q(z). This verifies (A7).

An application of Theorem 2.1 implies

\frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} {\hat{θ}}^{(j)} \overset{d}{\to} N (0, A),

for some positive definite matrix A ∈ ℝ⁽^d^−1)×(^d⁻¹⁾. Hence

\frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} U {\hat{θ}}^{(j)} \overset{d}{\to} N (0, {UAU}^{T}) .

(16)

By the definition of θ̂⁽^j⁾ and β̂⁽^j⁾, we have

| \frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} ({\hat{β}}^{(j)} - β_{0} - U {\hat{θ}}^{(j)}) | \leq | \frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} ∣ \sqrt{1 - {‖ {\hat{θ}}^{(j)} ‖}_{2}^{2}} - 1 ∣ | \leq | \frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} \frac{∣ 1 - {‖ {\hat{θ}}^{(j)} ‖}_{2}^{2} - 1 ∣}{∣ \sqrt{1 - {‖ {\hat{θ}}^{(j)} ‖}_{2}^{2}} + 1 ∣} | \leq \frac{n^{1 / 3}}{\sqrt{S}} \sum_{j} {‖ {\hat{θ}}^{(j)} ‖}_{2}^{2} .

With probability at least 1−S/n → 1, the last expression is equal to $O (\sqrt{S} n^{1 / 3} n^{- 2 / 3} {log}^{2 / 3} n) = o (1)$ , which is implied by the tail inequality for θ̂⁽^j⁾ established in Theorem 5.1. Combining this together with (16), we have

\frac{1}{\sqrt{S}} \sum_{j = 1}^{S} n^{1 / 3} ({\hat{β}}^{(j)} - β_{0}) \overset{d}{\to} N (0, {UAU}^{T}) .

3.3 Value search estimator

The value search estimator was introduced by Zhang et al. (2012) for estimating the optimal treatment regime. The data can be summarized as i.i.d. triples { $O_{i}^{(j)} = (X_{i}^{(j)}, A_{i}^{(j)}, Y_{i}^{(j)})$ ,i = 1,…,n;j = 1,…,S}, where $X_{i}^{(j)} \in ℝ^{d}$ denote patient’s baseline covariates, $A_{i}^{(j)}$ is the treatment received by the patient taking the value 0 or 1, and $Y_{i}^{(j)}$ is the response, the larger the better by convention. Consider the following model

Y_{i}^{(j)} = μ (X_{i}^{(j)}) + A_{i}^{(j)} C (X_{i}^{(j)}) + e_{i}^{(j)},

(17)

where μ(·) is the baseline mean function, C(·) is the contrast function, and $e_{i}^{(j)}$ is the random error with $E {e_{i}^{(j)} ∣ A_{i}^{(j)}, X_{i}^{(j)}} = 0$ . The optimal treatment regime is defined in the potential outcome framework. Specifically, let $Y_{i}^{(j) ★} (0)$ and $Y_{i}^{(j) ★} (1)$ be the potential outcomes that would be observed if the patient received treatment 0 or 1, accordingly. For a treatment regime d that maps $X_{i}^{(j)}$ to {0, 1}, define the potential outcome

Y_{i}^{(j) ★} (d) = d (X_{i}^{(j)}) Y_{i}^{(j) ★} (1) + {1 - d (X_{i}^{(j)})} Y_{i}^{(j) ★} (0) .

The optimal regime d^opt is defined as the rule that maximizes the expected potential outcome, i.e, the value function, $E {Y_{i}^{(j) ★} (d)}$ . Under the stable unit treatment value assumption (SUTVA) and no unmeasured confounders assumption (Splawa-Neyman, 1990), the optimal treatment regime under model (17) is given by d^opt(x) = I{C(x) > 0}.

The true contrast function C(·) can be complex. As suggested by Zhang et al. (2012), in practice we can find the restricted optimal regimen within a class of decision rules, such as linear treatment decision rules d(x, β) = I(β₁ + x₁β₂ + · · · + x_dβ_d₊₁ > 0) indexed by β ∈ ℝ^d⁺¹, where the subscript k denotes the kth element in the vector. Let β^★ = argmax_β V (β), where $V (β) = E {Y_{i}^{(j) ★} (d (X_{i}^{(j)}, β))}$ . To make β^★ identifiable, we assume $β_{1}^{★} = - 1$ . Define $θ^{★} = {(β_{2}^{★}, \dots, β_{d + 1}^{★})}^{T}$ . The restricted optimal treatment regime is given by d̃^opt(x, θ^★) = I(x^T θ^★ > 1) and the value function is defined by $V (θ) = E {Y_{i}^{(j) ★} (\tilde{d} (X_{i}^{(j)}, θ))}$ with d̃(x, θ) = I(x^T θ > 1). Zhang et al. (2012) proposed an inverse propensity score weighted estimator of the value function V (θ) and the associated value search estimator by maximizing the estimated value function. Specifically, for each group j, the value search estimator is defined as

{\hat{θ}}^{(j)} = arg max_{θ \in Θ} \frac{1}{n} \sum_{i = 1}^{n} \frac{\tilde{d} (X_{i}^{(j)}, θ) A_{i}^{(j)} + {1 - \tilde{d} (X_{i}^{(j)}, θ)} (1 - A_{i}^{(j)})}{π_{i}^{(j)} A_{i}^{(j)} + (1 - π_{i}^{(j)}) (1 - A_{i}^{(j)})} Y_{i}^{(j)},

(18)

where $π_{i}^{(j)} = Pr (A_{i}^{(j)} = 1 ∣ X_{i}^{(j)})$ is the propensity score and known in a randomized study. Here, for illustration purpose, we assume that $π_{i}^{(j)}$ ’s are known.

Define $m (O_{i}^{(j)}, θ) = ξ_{i}^{(j)} \tilde{d} (X_{i}^{(j)}, θ)$ , where

ξ_{i}^{(j)} = \frac{A_{i}^{(j)}}{π_{i}^{(j)}} C (X_{i}^{(j)}) + \frac{A_{i}^{(j)} - π_{i}^{(j)}}{π_{i}^{(j)} (1 - π_{i}^{(j)})} {μ (X_{i}^{(j)}) + e_{i}^{(j)}} = (\frac{A_{i}^{(j)}}{π_{i}^{(j)}} - \frac{1 - A_{i}^{(j)}}{1 - π_{i}^{(j)}}) Y_{i}^{(j)} .

With some algebra, we can show that θ̂⁽^j⁾ also maximizes $ℙ_{n}^{(j)} m (\cdot, θ)$ , where $ℙ_{n}^{(j)}$ is the empirical measure for data in group j. Unlike the previous two examples, here the function m is not bounded. To fulfill (A3), we need ${‖ ξ_{i}^{(j)} ‖}_{ψ_{1}} < \infty$ . This holds when $0 < γ_{1} < π_{i}^{(j)} < γ_{2} < 1$ for some constants γ₁ and γ₂, ${‖ C (X_{i}^{(j)}) ‖}_{ψ_{1}} < \infty, {‖ μ (X_{i}^{(j)}) ‖}_{ψ_{1}} < \infty$ and ${‖ e_{i}^{(j)} ‖}_{ψ_{1}} < \infty$ .

To show (A1) and (A6), we evaluate the integral

Γ (θ) = E {ξ \tilde{d} (X, θ)} = E {C (X) \tilde{d} (X, θ)} = \int_{x^{T} θ > 1} C (x) p (x) d x,

(19)

where p(x) is the density function of $X_{i}^{(j)}$ . Consider the transformation

T_{θ} = (I - {‖ θ ‖}_{2}^{- 2} θ θ^{T}) + {‖ θ ‖}_{2}^{- 2} θ {(θ^{★})}^{T},

which maps the region {x^T θ^★ > 1} onto {x^T θ > 1}, and {x^T θ^★ = 1} onto {x^T θ = 1}. We exclude the trivial case with θ^★ = 0. The above definition is meaningful when θ is taken over a small neighborhood N_δ of θ^★. We assume that functions p and C are continuously differentiable. Note that

\frac{\partial T_{θ} x}{\partial θ} = - \frac{{θ^{T} x - {(θ^{★})}^{T} x}}{{‖ θ ‖}_{2}^{2}} I - \frac{θ x^{T}}{{‖ θ ‖}_{2}^{2}} + \frac{2 θ θ^{T} (x^{T} θ - x^{T} θ^{★})}{{‖ θ ‖}_{2}^{4}} .

Using some differential geometry arguments similarly as in Section 5 of Kim and Pollard (1990), we can show that the integral (19) can be represented as

Γ (θ) = \int_{x^{T} θ^{★} > 1} [- \frac{1}{{‖ θ ‖}_{2}^{2}} θ^{T} \frac{\partial C (x) p (x)}{\partial x} x + \frac{{θ^{T} x - {(θ^{*})}^{T} x}}{{‖ θ ‖}_{2}^{4}} θ^{T} \frac{\partial C (x)}{\partial x} θ - \frac{θ^{T} x - {(θ^{*})}^{T} x}{{‖ θ ‖}_{2}^{2}} \frac{\partial C (x) p (x)}{\partial x}] d x,

which is thrice differentiable under certain conditions on C(x), p(x) and their derivatives.

To show (A7), we assume that the conditional density p(x|y) of X given Y = 1−X^T θ^★ exists and is continuously differentiable with respect to y. Similarly assume that the density q(y) of Y exists and is continuously differentiable. Let g(X) = E(ξ²|X). For any h₁, h₂ ∈ ℝ^d, we have

n^{1 / 3} E {ξ^{2} {∣ I (X^{T} θ^{★} + n^{- 1 / 3} X^{T} h_{1} > 1) - I (X^{T} θ^{★} + n^{- 1 / 3} X^{T} h_{2} > 1) ∣}^{2}} = n^{1 / 3} \int g (x) ∣ I (n^{- 1 / 3} x^{T} h_{1} > y) - I (n^{- 1 / 3} x^{T} h_{2} > y) ∣ p (x ∣ y) q (y) dxdy .

Let y = n^−1/3z. The last expression in the above equation can be written as

\int g (x) ∣ I (x^{T} h_{1} > z) - I (x^{T} h_{2} > z) ∣ p (x ∣ 0) q (0) dxdz + R = \int g (x) ∣ x^{T} (h_{1} - h_{2}) ∣ p (x ∣ 0) q (0) d x + R,

with the remainder term

R = \int g (x) ∣ I (x^{T} h_{1} > z) - I (x^{T} h_{2} > z) ∣ {p (x ∣ n^{- 1 / 3} z) q (n^{- 1 / 3} z) - p (x ∣ 0) q (0)} dxdz,

which is $O (n^{- 1 / 3} ({‖ h_{1} ‖}_{2}^{2} + {‖ h_{2} ‖}_{2}^{2}))$ under certain conditions on q(x) and p(x|·). Conditions (A2) and (A4) can be similarly verified. Since the class of functions {g(x)I(x^T θ > 1) : θ ∈ ℝ^d} has finite VC index, Condition (A5) also holds. Theorem 2.1 then follows.

4 Numerical studies

In this section, we examine the numerical performance of the aggregated M-estimator for the three examples studied in the previous section and compare it with the M-estimator based on pooled data, denoted as the pooled estimator.

4.1 Location estimator

The data X_j (j = 1, …, N) were independently generated from the standard normal distribution. The true parameter θ₀ that maximizes E{I(θ − 1 ≤ X_j ≤ θ + 1)} was set to be 0. Let θ̃₀ and θ̂₀ denote the pooled estimator and the aggregated estimator, respectively. To obtain θ̂₀, we randomly divided the data into S blocks with equal size n = N/S.

We took N = 2ⁱ for i = 14, 16, 18, 20, and choose S = 2^j such that 0.2 ≤ j/i < 0.625 when N = 2ⁱ. For each combination of N and S, we estimated the standard error of θ̂₀ by

\hat{S E} ({\hat{θ}}_{0}) = \frac{1}{\sqrt{S}} {\frac{1}{S - 1} \sum_{l = 1}^{S} {({\hat{θ}}^{(l)} - {\hat{θ}}_{0})}^{2}}^{1 / 2},

where θ̂ ⁽^l⁾ denotes the M-estimator for the lth group. For each scenario, we conducted 1000 simulation replications and plot the coverage probabilities of 95% predictive intervals in Figure 1. We also report the bias and sample standard deviation (denoted as SD) of estimators θ̃₀ and θ̂₀, and mean of estimated standard errors and coverage probability (denoted as CP) of Wald-type 95% confidence interval for θ̂₀ in Table 1 of the supplementary appendix, for some of the scenarios where N = 2ⁱ for i = 14, 16, 18, 20, and S = 2^j for j = 4,5,6,7. Unlike θ̂₀, θ̃₀ doesn’t converge to a tractable limiting distribution and it doesn’t have a convenient variance estimator. Hence, in Table 1, we didn’t provide the standard errors and confidence intervals for θ̃₀.

Coverage probability of 95% predictive interval with different choices of N and S, for the location estimator

From Figure 1, it is clear that for this specific application, the coverage probabilities are approximately 95% when S ≤ S* where S* ≈ N^0.55. In this example, the cubic rate estimator is one dimensional and according to Theorem A.1 and the discussion in Section A.5.1 in the supplementary appendix, the aggregated estimator is asymptotically normal when S = O(N^l) for 0 < l < 4/13. This is consistent with our numerical findings. Based on the results in Table 1, it can be seen that the aggregated estimator θ̂₀ has much smaller standard deviation than the pooled estimator θ̃₀, indicating the efficiency gain by the divide and conquer scheme as shown in our theory. In addition, the bias of θ̂₀ generally becomes bigger and the standard deviation of θ̂₀ generally becomes smaller when S and N increase, and the normal approximation becomes more accurate when S increases. This demonstrates the bias-variance trade off for aggregated estimators. With properly chosen S, the estimated standard error of θ̂₀ is close to its standard deviation and the coverage probability is close to the nominal level.

4.2 Maximum score estimator

Consider the model Y_i = 1.5X_i₁ −1.5X_i₂ +0.5e_i, i = 1, ···, N, where X_i₁, X_i₂ and e_i were generated independently from the standard normal. Hence, θ₀ = (θ₁, θ₂)^T = (1.5,−1.5)^T. Let θ̃₀ = (θ̃₁, θ̃₂)^T denote the pooled estimator and θ̂₀ = (θ̂₁, θ̂₂)^T the aggregated estimator. We set N = 2²⁰, 2²² and S = 2^j such that 0.18 ≤ j/i ≤ 0.42 when N = 2ⁱ. The coverage probabilities of 95% confidence intervals for θ̂₁ and θ̂₂ are plotted in Figure 2 based on 1000 replications. They are close to the nominal level when S ≤ S* where S* ≈ N^0.32. This example can also be regarded as a one-dimensional cubic rate estimation problem since θ̂₁ and θ̂₂ satisfy the constraint: ${\hat{θ}}_{1}^{2} + {\hat{θ}}_{2}^{2} = 1$ . Therefore, similar to the discussions in Section A.5.2, we can show θ̂₁ and θ̂₂ are asymptotically normal when S = O(N^l) for 0 < l < 4/13. This upper bound is close to S* since 4/13 ≈ 0.308. Other results are given in Table 2 of the supplementary appendix. The findings are very similar to those for the location estimator in the previous example.

Coverage probability of 95% predictive interval with different choices of N and S, for the maximum score estimator

4.3 Value search estimator

Consider the model Y_i = 1+A_i(2X_i − 1) + e_i, i = 1, ···, N, where X_i ~ N(0, 1), e_i ~ N(0, 0.25), and Pr(A_j = 1) = 0.5. Under this model assumption, the optimal treatment rule takes the form,

d^{opt} (x) = I (2 x > 1),

and hence β^★ = 2.

We take N = 2²⁴, 2²⁵, 2²⁶ and 2²⁷. When N = 2²⁴ and 2²⁵, we choose S = 2^j for j = 4, 5, 6, 7. When N = 2²⁶ and 2²⁷, we choose S = 2^j for j = 5, 6, 7, 8. This gives a total of 16 scenarios. We plot the coverage probabilities of 95% predictive intervals for θ̂₀ in Figure 3, with these combinations of S and N. When S ≤ S* ≈ N^0.27, the coverage probabilities are close to 95%. This is also a one-dimensional problem. Note that in this application, the rate 0.27 in the practical upper bound is slightly smaller than the theoretical upper bound 4/13 ≈ 0.308. However, it is noted that the theoretical upper bound is up to a scaling constant. When N becomes larger, the ratio log S*/log N should be close to or larger than 0.308. Details about the bias and the sample standard deviations of the aggregated estimator are given in Table 3 of the supplementary appendix.

Coverage probability of 95% predictive interval with different choices of N and S, for the value search estimator

4.4 Yahoo! Today Module user click log dataset

Online content recommendation services have received extensive attention both in the machine learning and statistics literature. These online services strive to make recommendations of advertisements or news articles to individual users by making use of both the content and user information. In this subsection, we apply the proposed method to a Yahoo! Today Module user click log dataset, which contains 45,811,883 user visits to the Today Module, during the first ten days in May 2009. Given such a large number of observations, it is extremely difficult to analyze the entire data on a single computer. This makes the divide and conquer method as an emerging need to deal with such large datasets.

For the ith visit, the dataset contains a binary response variable Y_i, an ID of the recommended article and a 6 dimensional feature vector of the user. Due to sensitivity and privacy concerns, feature definitions and article names were not included in the data. Here, Y_i = 1 means the user clicked the recommended article and Y_i = 0 means the user didn’t click. The last element in the feature vector is always 1, and the first five sums to 1. Therefore, we took the first three and the fifth elements in the feature vector to form the covariates X_i. For illustration, we only consider a subset of data that contains visits on May 1st where the recommended article ID is either 109510 or 109520. There were a total of 50 candidate articles on May 1st. We chose these two articles since they were being recommended most on that day. This gives us a total of 405888 visits. On the reduced dataset, define A_i = 1 if the recommended article is 109510 and A_i = 0 otherwise. In this example, the online recommendation problem can be formulated as follows. Denoted by 𝒟 a given set of functions that maps the covariate space to the space of article ID’s. Our aim is to find the optimal recommendation strategy to maximize user’s click through rate. We consider estimating the optimal recommendation rule among the set of linear decision functions 𝒟 = {I(x^T θ > 1) : ∀ θ ∈ ℝ⁴}. Hence, estimating the optimal recommendation strategy is similar to the problem of estimating the optimal treatment regime as described in Section 3.3. Specifically, we divide the data randomly into S pieces: ${(X_{i}^{(j)}, A_{i}^{(j)}, Y_{i}^{(j)}) : i = 1, \dots n_{j}}_{j = 1, \dots, S}$ and obtain

{\hat{θ}}^{(j)} = arg max_{θ \in ℝ^{4}} \frac{1}{n_{j}} \sum_{i = 1}^{n_{j}} {(\frac{A_{i}^{(j)}}{{\hat{π}}_{i}^{(j)}} I (θ^{T} X_{i}^{(j)} > 1) + \frac{1 - A_{i}^{(j)}}{1 - {\hat{π}}_{i}^{(j)}} I (β^{T} X_{i}^{(j)} \leq 1)) Y_{i}^{(j)} + (\frac{A_{i}^{(j)}}{{\hat{π}}_{i}^{(j)}} I (θ^{T} X_{i}^{(j)} > 1) + \frac{1 - A_{i}^{(j)}}{1 - {\hat{π}}_{i}^{(j)}} I (θ^{T} X_{i}^{(j)} \leq 1) - 1) {{\hat{h}}_{0 i}^{(j)} I (θ^{T} X_{i}^{(j)} \leq 1) + {\hat{h}}_{1 i}^{(j)} I (θ^{T} X_{i}^{(j)} > 1)}},

(20)

as the subgroup estimator, where ${\hat{π}}_{i}^{(j)}, {\hat{h}}_{0 i}^{(j)}, {\hat{h}}_{1 i}^{(j)}$ are estimators of $Pr (A_{i}^{(j)} = 1 ∣ X_{i}^{(j)}), Pr (Y_{i}^{(j)} = 1 ∣ A_{i}^{(j)} = 0, X_{i}^{(j)})$ and $Pr (Y_{i}^{(j)} = 1 ∣ A_{i}^{(j)} = 1, X_{i}^{(j)})$ respectively. The estimators ${\hat{π}}_{i}^{(j)}, {\hat{h}}_{0 i}^{(j)}, {\hat{h}}_{1 i}^{(j)}$ are obtained by logistic regressions. We chose n_j such that max_j n_j − min_j n_j ≤ 1. The estimated optimal recommendation strategy is given as I(x^T θ̂₀ > 1) where θ̂₀ = Σ_j θ̂⁽^j⁾/S.

Remark 4.1

Compared to the value search estimator defined in (18), here we obtain the subgroup estimator by maximizing an augmented version of the inverse propensity score weighted estimator. The resulting estimator also converges at a rate of n^−1/3 but is more efficient than the original one in (18).

Due to data confidentiality agreement, we are not able to use the raw data. Here, we generate pseudo responses ${\tilde{Y}}_{i}^{(j)}$ given $X_{i}^{(j)}$ and $A_{i}^{(j)}$ from the Yahoo data, and use the dataset $X_{i}^{(j)}, A_{i}^{(j)}, {\tilde{Y}}_{i}^{(j)}$ : i = 1,…,n_j,j = 1,…,S} in our application. The generated variables ${\tilde{Y}}_{i}^{(j)}$ ’s are similar to the original responses $Y_{i}^{(j)}$ ’s. For example, we have $\sum_{i, j} Y_{i}^{(j)} / \sum_{j} n_{i} \approx 4.71 %$ while $\sum_{i, j} {\tilde{Y}}_{i}^{(j)} / \sum_{j} n_{i} \approx 4.73 %$ . Besides, under our data generating process, the population limit of θ̂ ⁽^j⁾ in (20) can be explicitly calculated as θ₀ = (θ_0,1, θ_0,2, θ_0,3, θ_0,4)^T = (2.534, 2.881, 2.796, 3.200)^T for any j. Hence, θ₀ is also the population limit of θ̂₀ when S does not diverge too fast. Detailed descriptions of generating ${\tilde{Y}}_{i}^{(j)}$ ’s are given in Section I of the supplementary appendix.

We choose S = 2^j for j = 4, 5, …, 10. Under a given S, denoted by ${\hat{θ}}_{0}^{(S)} = {({\hat{θ}}_{0, 1}^{(S)}, {\hat{θ}}_{0, 2}^{(S)}, {\hat{θ}}_{0, 3}^{(S)}, {\hat{θ}}_{0, 4}^{(S)})}^{T}$ the corresponding aggregated estimator. For each S, we use sample variance to estimate the variance of the aggregated estimator. Based on these estimates, we plot the estimators ${\hat{β}}_{0, i}^{(S)}$ and the Wald-type 95% confidence intervals of θ_0,_i in Figure 4, for i = 1, …,4 with different choices of S.

95% confidence intervals of θ_0,1, θ_0,2, θ_0,3 and θ_0,4 from top to bottom and from left to right, against log(S)/log(2). Dash lines are the corresponding θ_0,_i’s.

It is clear from Figure 4 that the variance of ${\hat{θ}}_{0}^{(S)}$ decreases as S increases, since the width of confidence intervals decreases as S increases. Moreover, when S is extremely large, some of the parameters are not covered in the 95% confidence intervals. For example, from the top left plot in Figure 4, θ_0,1 is not covered in the confidence intervals of ${\hat{θ}}_{0, 1}^{(S)}$ when S = 2⁹ and 2¹⁰. Such phenomenon is due to the large bias of ${\hat{θ}}_{0}^{(S)}$ . These empirical results demonstrate the bias-variance trade off for the aggregated estimator, and are consistent with our theoretical findings.

5 Tail inequality for ĥ(j)

In this section, we establish tail inequalities for θ̂ ⁽^j⁾ and ĥ ⁽^j⁾, which are used to construct h̃⁽^j⁾, a truncated version of ĥ ⁽^j⁾ with tail equivalence.

Theorem 5.1

Under Conditions (A1)-(A5), for sufficiently large n_j, there exists some constant C₀, such that

P r ({\hat{θ}}^{(j)} \notin N_{δ}) \leq 2 exp (- C_{0} n_{j}) .

(21)

Moreover, for sufficiently large n_j, there exist some constants C₁, C₂ > 0 and N₀ ≥ 2, such that

P r ({‖ {\hat{h}}^{(j)} ‖}_{2}) \geq x ∣ {\hat{θ}}^{(j)} \in N_{δ} ∣ \leq C_{2} exp (- C_{1} x^{3}),

(22)

for any $N_{0} \leq x \leq n_{j}^{1 / 3} δ$ .

Remark 5.1

(21) and (22) can be viewed as generalization of the consistency and rate of convergence results established for cube root estimators (cf. Corollary 4.2 in Kim and Pollard, 1990). The tail probability of ||ĥ ⁽^j⁾||₂ is obtained based on the subexponential tail Assumption (A3) for m(·, θ).

We represent ĥ⁽^j⁾ as

{\hat{h}}^{(j)} = arg max_{h \in H_{n_{j}}} M_{n_{j}, j} (h) \equiv arg max_{h \in H_{n_{j}}} {n_{j}^{1 / 6} G_{n_{j}}^{(j)} (m_{h}^{(j)}) + n_{j}^{2 / 3} E (m_{h}^{(j)})},

where $H_{n_{j}} = {h \in ℝ^{d} : n_{j}^{- 1 / 3} h + θ_{0} \in Θ}, G_{n_{j}}^{(j)} = {n_{j}}^{1 / 2} (ℙ_{n_{j}}^{(j)} - E)$ and $m_{h}^{(j)} (\cdot) = m (\cdot, θ_{0} + n_{j}^{- 1 / 3} h) - m (\cdot, θ_{0})$ . Similarly define

{\tilde{h}}^{(j)} = arg max_{h \in H_{n_{j}} \cap H_{δ_{n}}} M_{n_{j}, j} (h) = arg max_{h \in H_{n_{j}} \cap H_{δ_{n}}} {n_{j}^{1 / 6} G_{n}^{(j)} (m_{h}^{(j)}) + n_{j}^{2 / 3} E (m_{h}^{(j)})},

where H_{δ_n} = {h : ||h||₂ ≤ δ_n}. By its definition, we have ||h̃⁽^j⁾||₂ ≤ δ_n. The following Corollaries are immediate applications of Theorem 5.1.

Corollary 5.1

Assume $δ_{n} \leq n_{j}^{1 / 3} δ$ . Under Conditions (A1)-(A5), for sufficiently large n_j, there exist some constants N₀ ≥ 2, C₄ and C₅, such that

P r ({‖ {\tilde{h}}^{(j)} ‖}_{2} > x) \leq C_{5} exp (- C_{4} x^{3}), \forall x \geq N_{0} .

(23)

The proof is straightforward by noting that for any $x \leq n_{j}^{1 / 3} δ$ ,

Pr ({‖ {\tilde{h}}^{(j)} ‖}_{2} > x) \leq Pr ({‖ {\tilde{h}}^{(j)} ‖}_{2} > x ∣ {\hat{θ}}^{(j)} \in N_{δ}) Pr ({\hat{θ}}^{(j)} \in N_{δ}) + Pr ({\hat{θ}}^{(j)} \notin N_{δ}) \leq C_{2} exp (- C_{1} x^{3}) + 2 exp (- C_{0} n_{j}) \leq C_{5} exp (- C_{4} x^{3}) .

Remark 5.2

Corollary 5.1 suggests that h̃⁽^j⁾ has finite moments of all orders. For any a ∈ ℝ^d and positive integer k, this implies that the sequence of random variables |a^Th̃⁽^j⁾|^k are uniformly integrable. This result is useful in establishing the convergence for moments of h̃⁽^j⁾ (see Corollary 5.3).

Corollary 5.2

Under Conditions (A1)-(A5) and (A8), taking $δ_{n} = max (3^{1 / 3}, 3^{1 / 3} / C_{1}^{1 / 3}) {log}^{1 / 3} n_{j}$ where C₁ is defined in Theorem 5.1, then h̃⁽^j⁾ and ĥ⁽^j⁾ are tail equivalent. If S = o(n³), then $\sum_{j = 1}^{S} {\tilde{h}}^{(j)}$ and $\sum_{j = 1}^{S} {\hat{h}}^{(j)}$ are also tail equivalent.

Tail equivalence of h̃⁽^j⁾ and ĥ⁽^j⁾ follows by

Pr ({\tilde{h}}^{(j)} \neq {\hat{h}}^{(j)}) = Pr ({‖ {\hat{h}}^{(j)} ‖}_{2} > δ_{n}) \leq \frac{C_{2}}{n_{j}^{3}} + 2 exp (- C_{0} n_{j}) \leq \frac{C_{2} {\bar{c}}^{3}}{n^{3}} + 2 exp (- \frac{C_{0} n}{\bar{c}}),

(24)

where the first inequality is implied by Theorem 5.1 and the last inequality is due to Condition (A8). The second assertion follows by an application of Bonferroni’s inequality.

Corollary 5.2 proves (7). From now on, we take $δ_{n_{j}} = max (3^{1 / 3}, 3^{1 / 3} / C_{1}^{1 / 3}) {log}^{1 / 3} n_{j}$ . By (24), Slutsky’s Theorem implies ${\tilde{h}}^{(j)} \overset{d}{\to} h_{0}$ . Applying Skorohod’s representation Theorem (cf. Section 9.4 in Athreya and Lahiri, 2006), we have that there exist random vectors ${\tilde{h}}^{(j) ★} \overset{d}{=} {\tilde{h}}^{(j)}$ and $h_{0}^{★} \overset{d}{=} h_{0}$ such that ${\tilde{h}}^{(j) ★} \to h_{0}^{★}$ , almost surely. This together with the uniform integrability of ${‖ {\tilde{h}}^{(j)} ‖}_{2}^{k}$ gives the following Corollary.

Corollary 5.3

Under Conditions (A1)-(A5), for any a ∈ ℝ^d and integer k ≥ 1, we have E{(a^Th̃⁽^j⁾)^k} → E{(a^T h₀)^k} as n_j → ∞.

Remark 5.3

Due to the i.i.d assumption of $X_{i}^{(j)}$ , E{(a^Th̃⁽^j⁾)^k} is a function of n_j only. Under Condition (A8), Corollary (5.3) implies

sup_{j} ∣ E {{(a^{T} {\tilde{h}}^{(j)})}^{k}} - E {{(a^{T} h_{0})}^{k}} ∣ \to 0, as n \to \infty .

Taking k = 2, it proves (9). Taking k = 3, it proves (10). Moreover, Corollary 5.3 suggests a simple scheme for estimating the covariance matrix A ≡ cov(h₀) given in (5). For any vector a, by law of large numbers, we obtain

\frac{1}{S} \sum_{j = 1}^{S} {(a^{T} {\tilde{h}}^{(j)})}^{2} - \frac{1}{S} \sum_{j = 1}^{S} E {(a^{T} {\tilde{h}}^{(j)})}^{2} \overset{a . s .}{\to} 0.

This together with tail equivalence between h̃⁽^j⁾ and ĥ⁽^j⁾, and (9) implies that Σ_j(a^Tĥ⁽^j⁾)²/S converges to a^TAa.

6 Discussion

In this paper, we provide a general inference framework for aggregated M-estimators with cubic rates obtained by the divide and conquer method. Our results demonstrate that the aggregated estimators have faster convergence rate than the original M-estimators based on pooled data and achieve the asymptotic normality when the number of groups S does not grow too fast with respect to n, the average sample size of each group.

6.1 Rate of the bias

For a general cubic-rate estimator with sample size n, we showed its bias can be bounded by O((n/log n)^−5/12). In comparison, Banerjee et al. (2016) obtained a sharper bound in the specific setting of monotone regression and showed that the bias of the isotonic estimator can be bounded by O(n^−7/15+^ζ) for any ζ > 0 and the bias of its inverse bounded by o(n^−1/2) (see Theorem 4.3 and 4.4 in that paper). As commented before, this is because we work on a more general setting and their techniques cannot be easily generalized to other cubic-rate M-estimation problems.

However, it is possible to sharpen the bound for some special cases. In particular, when the parameter is one-dimensional, we show in Theorem A.1 (see also Corollary A.1) in the supplementary appendix that the bias of the estimator can be bounded by O(n^−5/9 log^9/14 n) based on the KMT approximation. Note that this bound is even sharper than those in Theorem 4.3 and 4.4 in Banerjee et al. (2016). This is because we assume a stronger assumption on the Lipschitz continuity of the Hessian matrix (see Assumption A6 and Equation (4.3) in Banerjee et al., 2016). Under Assumption A8, Theorem A.1 implies the asymptotic normality holds for the aggregated estimator as long as the number of machines satisfies S = O(N^l) for some l < 4/13, where N is the total number of observations. Again this upper bound on S may still be conservative, however it improves a lot compared to Theorem 2.1. We further apply our theorem to the location estimator (see Section A.5.1) and the one-dimensional value search estimator (see Section A.5.2) for illustration.

For the bias of a general cubic-rate M-estimator, our proof relies on the Gaussian approximation of the suprema of empirical processes (cf. Chernozhukov et al., 2013, 2014) and the Sudakov-Fernique type error bound (Chatterjee, 2005). The proofs for these theorems are based on smooth approximation of the supremum function. It remains unknown whether the rates of these error bounds are optimal and whether they can be improved using other techniques. This is an interesting problem that needs further investigation.

6.2 The super-efficiency phenomenon

In the context of isotonic regression, Banerjee et al. (2016) showed that the faster convergence rate of the aggregated estimator of the inverse function for a fixed model comes at a price, that is, the maximal risk over a class of models in a neighborhood of the given model remains bounded for the pooled estimator but diverges to infinity for the aggregated estimator (see Theorem 6.1 in Banerjee et al., 2016). This is referred to as the super-efficiency phenomenon, which is seen in nonparametric function estimation as well (cf. Brown et al., 1997).

We believe such super-efficiency phenomenon holds for many other cubic-rate M-estimation problems as well. In the supplementary appendix, we mathematically formalize the notion of the super-efficiency phenomenon for general M-estimation problems, and establish such phenomenon for the location estimator (see Section B.1) and the value search estimator (see Section B.2). The super-efficiency phenomenon is essentially due to that the maximal bias over a large class of models for the aggregated estimator will diverge to infinity. We suspect this is because the condition on the Lipschitz continuity of the Hessian matrix (Assumption A6) cannot hold uniformly for all models in such a class. We discuss this in details in the supplementary appendix.

6.3 Other issues

In the current setup, we assume all $X_{i}^{(j)}$ ’s are independently and identically distributed. It will be interesting to generalize Theorem 2.1 to the setting where $X_{i}^{(j)}$ ’s are independent, but not identically distributed. However, the meaning of the aggregated estimator may become unclear in some applications, such as the value search estimator, and the derivation of the asymptotic properties of the resulting aggregated estimator becomes much more involved. This needs further investigation.

Supplementary Material

supp

NIHMS915728-supplement-supp.pdf^{(257.4KB, pdf)}

Acknowledgments

The authors would like to thank an associate editor and three referees for their thoughtful and constructive comments, which help to improve an earlier version of the paper. This work was partly supported by a NIH grant P01 CA142538.

References

Athreya KB, Lahiri SN. Springer Texts in Statistics. Springer; New York: 2006. Measure theory and probability theory. [Google Scholar]
Banerjee M, Durot C, Sen B. Divide and conquer in non-standard problems and the super-efficiency phenomenon. 2016 arXiv preprint arXiv:1605.04446. [Google Scholar]
Brown LD, Low MG, Zhao LH. Superefficiency in nonparametric function estimation. Ann Statist. 1997;25(6):2607–2625. [Google Scholar]
Chatterjee S. An error bound in the sudakov-fernique inequality. 2005 arXiv preprint math/0510424. [Google Scholar]
Chen X, Xie M-g. A split-and-conquer approach for analysis of extraordinarily large data. Statist Sinica. 2014;24(4):1655–1684. [Google Scholar]
Chernoff H. Estimation of the mode. Ann Inst Statist Math. 1964;16:31–41. [Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Statist. 2013;41(6):2786–2819. [Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. Gaussian approximation of suprema of empirical processes. Ann Statist. 2014;42(4):1564–1597. [Google Scholar]
Csörgő M, Csörgő S, Horváth L, Révész P. On weak and strong approximations of the quantile process. Proceedings of the seventh conference on probability theory; Braşov. 1982; Utrecht: VNU Sci. Press; 1985. pp. 81–95. [Google Scholar]
Durot C. Sharp asymptotics for isotonic regression. Probab Theory Related Fields. 2002;122(2):222–240. [Google Scholar]
Geyer CJ. On the asymptotics of constrained M-estimation. Ann Statist. 1994;22(4):1993–2010. [Google Scholar]
Groeneboom P, Hooghiemstra G, Lopuhaä HP. Asymptotic normality of the L1 error of the Grenander estimator. Ann Statist. 1999;27(4):1316–1347. [Google Scholar]
Kim J, Pollard D. Cube root asymptotics. Ann Statist. 1990;18(1):191–219. [Google Scholar]
Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. J R Stat Soc Ser B Stat Methodol. 2014;76(4):795–816. [Google Scholar]
Koltchinskii VI. Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions. J Theoret Probab. 1994;7(1):73–118. [Google Scholar]
Komlós J, Major P, Tusnády G. An approximation of partial sums of independent RV’s and the sample DF. I. Z Wahrscheinlichkeitstheorie und Verw Gebiete. 1975;32:111–131. [Google Scholar]
Kosorok MR. Springer Series in Statistics. Springer; New York: 2008. Introduction to empirical processes and semiparametric inference. [Google Scholar]
Rio E. Local invariance principles and their application to density estimation. Probab Theory Related Fields. 1994;98(1):21–45. [Google Scholar]
Sakhanenko AI. Estimates in the invariance principle in terms of truncated power moments. Sibirsk Mat Zh. 2006;47(6):1355–1371. [Google Scholar]
Splawa-Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist Sci. 1990;5(4):465–472. [Google Scholar]
van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression. Conference on Learning Theory; 2013. pp. 592–617. [Google Scholar]
Zhao T, Cheng G, Liu H. A partially linear framework for massive heterogeneous data. Ann Statist. 2016;44(4):1400–1437. doi: 10.1214/15-AOS1410. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp

NIHMS915728-supplement-supp.pdf^{(257.4KB, pdf)}

[R1] Athreya KB, Lahiri SN. Springer Texts in Statistics. Springer; New York: 2006. Measure theory and probability theory. [Google Scholar]

[R2] Banerjee M, Durot C, Sen B. Divide and conquer in non-standard problems and the super-efficiency phenomenon. 2016 arXiv preprint arXiv:1605.04446. [Google Scholar]

[R3] Brown LD, Low MG, Zhao LH. Superefficiency in nonparametric function estimation. Ann Statist. 1997;25(6):2607–2625. [Google Scholar]

[R4] Chatterjee S. An error bound in the sudakov-fernique inequality. 2005 arXiv preprint math/0510424. [Google Scholar]

[R5] Chen X, Xie M-g. A split-and-conquer approach for analysis of extraordinarily large data. Statist Sinica. 2014;24(4):1655–1684. [Google Scholar]

[R6] Chernoff H. Estimation of the mode. Ann Inst Statist Math. 1964;16:31–41. [Google Scholar]

[R7] Chernozhukov V, Chetverikov D, Kato K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Statist. 2013;41(6):2786–2819. [Google Scholar]

[R8] Chernozhukov V, Chetverikov D, Kato K. Gaussian approximation of suprema of empirical processes. Ann Statist. 2014;42(4):1564–1597. [Google Scholar]

[R9] Csörgő M, Csörgő S, Horváth L, Révész P. On weak and strong approximations of the quantile process. Proceedings of the seventh conference on probability theory; Braşov. 1982; Utrecht: VNU Sci. Press; 1985. pp. 81–95. [Google Scholar]

[R10] Durot C. Sharp asymptotics for isotonic regression. Probab Theory Related Fields. 2002;122(2):222–240. [Google Scholar]

[R11] Geyer CJ. On the asymptotics of constrained M-estimation. Ann Statist. 1994;22(4):1993–2010. [Google Scholar]

[R12] Groeneboom P, Hooghiemstra G, Lopuhaä HP. Asymptotic normality of the L1 error of the Grenander estimator. Ann Statist. 1999;27(4):1316–1347. [Google Scholar]

[R13] Kim J, Pollard D. Cube root asymptotics. Ann Statist. 1990;18(1):191–219. [Google Scholar]

[R14] Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. J R Stat Soc Ser B Stat Methodol. 2014;76(4):795–816. [Google Scholar]

[R15] Koltchinskii VI. Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions. J Theoret Probab. 1994;7(1):73–118. [Google Scholar]

[R16] Komlós J, Major P, Tusnády G. An approximation of partial sums of independent RV’s and the sample DF. I. Z Wahrscheinlichkeitstheorie und Verw Gebiete. 1975;32:111–131. [Google Scholar]

[R17] Kosorok MR. Springer Series in Statistics. Springer; New York: 2008. Introduction to empirical processes and semiparametric inference. [Google Scholar]

[R18] Rio E. Local invariance principles and their application to density estimation. Probab Theory Related Fields. 1994;98(1):21–45. [Google Scholar]

[R19] Sakhanenko AI. Estimates in the invariance principle in terms of truncated power moments. Sibirsk Mat Zh. 2006;47(6):1355–1371. [Google Scholar]

[R20] Splawa-Neyman J. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist Sci. 1990;5(4):465–472. [Google Scholar]

[R21] van der Vaart AW, Wellner JA. Springer Series in Statistics. Springer-Verlag; New York: 1996. Weak convergence and empirical processes. With applications to statistics. [Google Scholar]

[R22] Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zhang Y, Duchi J, Wainwright M. Divide and conquer kernel ridge regression. Conference on Learning Theory; 2013. pp. 592–617. [Google Scholar]

[R24] Zhao T, Cheng G, Liu H. A partially linear framework for massive heterogeneous data. Ann Statist. 2016;44(4):1400–1437. doi: 10.1214/15-AOS1410. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Massive Data Framework for M-Estimators with Cubic-Rate

Chengchun Shi

Wenbin Lu

Rui Song

Roles

Abstract

1 Introduction

2 Method

Remark 2.1

2.1 Main results

Theorem 2.1

Remark 2.2

Remark 2.3

Remark 2.4

3 Applications

3.1 Location estimator

3.2 Maximum score estimator

3.3 Value search estimator

4 Numerical studies

4.1 Location estimator

Figure 1.

4.2 Maximum score estimator

Figure 2.

4.3 Value search estimator

Figure 3.

4.4 Yahoo! Today Module user click log dataset

Remark 4.1

Figure 4.

5 Tail inequality for ĥ(j)

Theorem 5.1

Remark 5.1

Corollary 5.1

Remark 5.2

Corollary 5.2

Corollary 5.3

Remark 5.3

6 Discussion

6.1 Rate of the bias

6.2 The super-efficiency phenomenon

6.3 Other issues

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases