Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Apr 6;50(3):555–573. doi: 10.1080/02664763.2021.1911967

A distributed multiple sample testing for massive data

Xie Xiaoyue a,b, Jian Shi a,b,CONTACT, Kai Song c
PMCID: PMC10128098  PMID: 37114090

ABSTRACT

When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. This paper mainly develops and investigates a distributed two-node Kolmogorov–Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In addition, this paper also provides a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed hypothesis testing scheme. The distributed fraud detection is to detect which node stores fraud data in multi-node machines and the distribution-based classification is to determine whether the multi-node distributions differ and classify different distributions. These methods can improve the accuracy of statistical inference in a distributed storage architecture. Furthermore, this paper verifies the feasibility of the proposed methods by simulation and real example studies.

Keywords: Distributed scheme, hypothesis testing, fraud detection, classification

1. Introduction

In recent years, statistical inference faces tremendous challenges in computation and data storage. The exceedingly large size of data often makes it impossible to store all of them on a single machine and many applications have individual agents (local governments, research labs, hospitals and smart phones) collecting data independently. Communication between agents is prohibitively expensive due to the limited bandwidth, and direct data sharing has also raised privacy and loss of ownership concerns. These constraints make it necessary to develop methodologies for distributed systems. Distributed statistical inference has received considerable attention in modern distributed computing architecture. More specifically, it covers a wide spectrum of topics including principal component analysis [9,10], M-estimation [2,3,15,20,27,34], nonparametric regression [11,25,36], Bayesian methods [13,28], confidence intervals [6,29], quantile regression [26], bootstrap [14], and so on.

However, when the proposed distributed statistical inference method is applied to actual data, the intra-node and inter-node samples need to satisfy the most basic independent and identically distributed assumption. If the actual dataset does not satisfy this basic assumption, the actual performance may not be good no matter how effective the inference method is in theory. Therefore, it is particularly important to verify this basic assumption in practical applications. For the independence assumption, it is easily satisfied under the distributed storage architecture. For the identical distribution assumption, one needs to determine whether all node datasets come from the same population, that is, test the equality of inter-node distributions. When the sample size of each node is not large and all node samples can be easily accessed, we can adopt classical nonparametric hypothesis testing schemes to verify the identical distribution assumption. For example, when the number of node k = 2, the two-sample Kolmogorov–Smirnov (KS) test can be used to determine whether the two underlying distributions differ. However, this classical nonparametric testing is hard to conduct when each node dataset is extremely large or is not available due to the privacy protection. For instance, observations of socio-economic indicators are in general distributed and stored in various administrative regions. If we are interested in making inference on some important indicators such as lifetime and happiness index nationally, it is necessary to test whether samples in different regions come from the same distribution. But the conventional KS test is tricky in this case because the dataset in each region is usually huge and could hardly be processed in whole by a proxy simultaneously. Therefore, this paper focuses on developing distributed KS testing schemes via the divide-and-conquer strategy to solve the two-node distribution testing problem.

Our proposed testing schemes, respectively, aggregate the test decision results and p-values of hypothesis testing in each data block to obtain a final decision: whether or not to reject the null hypothesis. In general, one needs to consider what weight should be used to combine the results from each data block; however, our proposed approaches can avoid this matter. Furthermore, based on the distributed testing schemes, we also provide the fraud detection and distributed-classification approaches for multi-node machines.

The rest of this paper is organized as follows. Section 2 proposes the distributed multiple sample testing – the distributed two-node KS testing schemes. In Section 3, we conduct some discussions on block size in distributed test schemes. Section 4 provides a fraud detection approach in the distributed architecture via the distributed KS two-node testing schemes. Section 5 discusses a classification method to test whether the multi-node distributions differ and classify different distributions. Section 6 applies the proposed testing schemes to real data. We conclude in Section 7. When appropriate, detailed proofs are relegated to the appendix.

2. The distributed two-node KS testing schemes

2.1. Problem setup

When the sample stored in two nodes is relatively small, we can compute KS test statistic on a proxy easily and quickly. However, it can be prohibitive to transmit or deal with full data by the proxy due to bandwidth limitation or memory constraint when the amount of data stored in two nodes is huge. In this setting, the conventional KS test procedure is difficult to implement. Therefore, we propose the distributed two-node KS testing schemes via the divide-and-conquer strategy: batch test and then aggregation. The proposed testing schemes not only solve the problem that the conventional test cannot be directly used, but also improve the computation efficiency as shown in the later discussion. For simplicity, we assume the sample sizes of the two-node machines are identical. For details on the two-sample KS test, see [1,18,19,23].

2.2. The testing schemes

There are N independent samples {X,Y} from continuous distribution functions F and G, and they are stored in two-node machines. We randomly and evenly partition these data into m subsets {X(i),Y(i)}i=1m with a sample size of n where N=m×n and n is manageable. Our goal is to test

H0:F=G,versusH1:FG. (1)

At first, we conduct the two-sample KS test for each data block {X(i),Y(i)}, i=1,,m. More specifically, the two-sided KS test statistic for the ith data block is

Dn(i)=n/2supu|Fn(i)(u)Gn(i)(u)|,

where Fn(i)(u) and Gn(i)(u) are the empirical distribution functions of the X(i) and Y(i), respectively. According to [24], we have

limnP(Dn(i)z|H0)=K(z),z>0,

where

K(z)=12[e2z2e2(2z)2+e2(3z)2]=12k=1+(1)k1e2k2z2.

Then we reject H0 with the significant level α for the ith data block if

Dn(i)>K1α,

where K1α is the (1α)th quantile of the Kolmogorov distribution K(z).

The asymptotic analysis of the distribution of a two-sample Smirnov statistic and the rate of convergence in the distribution of a two-sided Smirnov statistic were studied extensively in [5,8,12]. We have the following proposition.

Proposition 2.1

For any z>0, as n, there holds

supz|P(Dn(i)z|H0)K(z)|=O(n1/2),

where i=1,2,,m.

Next, we will, respectively, aggregate the test decision results and p-values of hypothesis testing in each data block to obtain a final decision.

  • Scheme I

Assume that hn(i) is the ith test decision result with the significance level α under the null hypothesis H0, i=1,2,,m, where hn(i) is defined as

hn(i)=1,reject H0;0,accept H0. (2)

Furthermore, we have that

hn(1)|H0,hn(2)|H0,,hn(m)|H0i.i.d. Bernoulli(αn).

where αn=P(hn(i)=1|H0). Based on hn(i), i=1,2,,m, we now turn to consider the following testing problem

H0a:αnα,versusH1a:αn>α. (3)

To this end, we construct the test statistic

Um,na=i=1mhn(i)mαmα(1α).

To conduct the testing, we need to know the limiting distribution of the constructed statistic. We establish the asymptotic distribution of the test statistic Um,na in the following theorem.

Theorem 2.2

Provided that m=o(n), under the null hypothesis H0, we have

limm,nP(Um,nat)=Φ(t), (4)

where Φ(t) is the cumulative distribution function of the standard normal distribution.

Thus, the testing scheme for (3) is

reject H0a,if Um,na>Z1α,

where Z1α is (1α)th quantile of the standard normal distribution.

This testing procedure aggregates m test decision results and thus produces a reasonable distributed testing scheme: if H0a is rejected, H0 is accordingly rejected as well, that is,

reject H0,if Um,na>Z1α.

For simplicity, this distributed testing scheme is called the DCR-KS test.

  • Scheme II

In addition to the above test decision results, we can also obtain the corresponding p-values, denoted as pn(i), when the KS test for the ith data block is implemented. Let G be the asymptotic distribution function of pn(i) for any i, then the function G is a uniform distribution under the null hypothesis. Hence, we now turn to consider the following hypothesis testing problem

H0b:G=G0,versusH1b:GG0,

where G0 is the cumulative function of the standard uniform distribution.

To this end, we construct the Cramér–Von Mises-type test statistics

Um,nb=m01(Gm,n(x)G0(x))2dG0(x),

where Gm,n(x) is the empirical distribution function of {pn(i)}i=1m.

It can be inferred that under the null hypothesis H0, when m=o(n1/2) we have

Um,nbdW01B2(u)du,as m,n, (5)

where B(u) is the standard Brownian bridge. In this case, the testing scheme can be given by

reject H0,if Um,nb>C1α,

where C1α is (1α)th quantile of the distribution of the asymptotic distribution of the test statistic Um,nb. This testing scheme is called the DCP-KS test.

Remark 2.1

It is worth noting that the critical values used in the two schemes are determined by the corresponding limiting null distributions whose theoretical bases are, respectively, established under the conditions of m,m=o(n) and m,m=o(n1/2). Therefore, m cannot be too small for DCP-KS and DCR-KS in practice.

Next, we simply discuss the lower bound of m for DCR-KS and DCP-KS in applications. Essentially, the DCR-KS aggregation testing problem is a large sample Bernoulli proportion hypothesis testing. For this testing problem, the sampling distribution of the proportion can be well approximated by a normal distribution under H0 when mα>5 with a significance level of α (see [30]). Similarly, the DCP-KS aggregation problem is a goodness-of-fit hypothesis testing. For this testing problem, when the Cramér–von Mises test is adopted, the sampling distribution of the statistic can be well approximated by the asymptotic distribution with the sample size greater than 10 (see [7]). Hence, with a significance level of 5%, we suggest to select m>100 for DCR-KS and m>10 for DCP-KS.

3. Discussion of block size m

It is a basic problem on the choice of m in the application of a distributed algorithm. The general purpose of selecting m is to balance the statistical efficiency and the computational cost – that is, it is desirable to reduce the computational cost as much as possible without losing the statistical efficiency. This problem has been paid more and more attention by researchers in recent years. Here, we also discuss some related problems on block size for the proposed testing schemes.

3.1. The effect of block size on the test performance

We first investigate the performance of the proposed DCR-KS and DCP-KS tests with different m via a simulation study. Test size and power are, respectively, calculated based on 1000 independent trials. The two-node samples are independently generated from the normal distribution with XN(0,1) and YN(δ,1), where δ=0,0.03,0.04,0.05. In particular, δ=0 corresponds to true H0 which will be used to examine the size of the test statistics. The results are summarized in Figure 1 with a sample size of N=105.

Figure 1.

Figure 1.

(a) Size for the two testing schemes as a function of block size. Each method is shown using a different colored symbol; (b) power with mean increment δ=0.03,0.04 and 0.05 for the two testing schemes as a function of block size. Each increment is shown using a different colored symbol.

Panel (a) of Figure 1 displays the sizes of DCR-KS and DCP-KS. It can be seen that the sizes of the two tests are close to both the nominal level α=0.05 and the level of oracle with the full sample when m[40,200] for DCR-KS and m[10,50] for DCP-KS. Overall, DCP-KS performs better for small m, but DCR-KS performs better for large m in controlling type I error. Besides, panel (b) of Figure 1 displays the powers of the two tests for three different mean increments δ=0.03, 0.04 and 0.05. We see that both DCR-KS and DCP-KS test powers increase as δ goes from 0.03 to 0.05 but decline as m grows. It is clear from the results that the block size does affect the performance of the two tests, and therefore, an appropriate block size is desirable in order to achieve a reasonable size and power of test.

In addition, we also compare computation time between the proposed testing schemes and the conventional testing scheme when massive data could be handled by a single machine. In the setting with a sample size of 108 from the null distribution N(0,1), we record the computation time of three testing schemes on a desktop computer with a 3.20 GHz Inter Core i7-8700 CPU and 16 GB memory as shown in Table 1. The result shows that the DCR-KS and DCP-KS tests can save up to 70% of time compared with the conventional testing scheme, which reflects that our schemes can greatly improve computation efficiency for massive data in a limited memory constraint.

Table 1.

The computation time of the three testing schemes.

Case Method Time
N=108 DCR-KS 63.5060 s
m = 100 DCP-KS 64.3648 s
N(0,1)vs. N(0,1) Oracle 237.2067 s

3.2. A data-driven selection criterion on block size

Another interesting problem is how to select m in the application of DCR-KS and DCP-KS testing schemes. Some data-driven methods on the choice of m have been proposed in the divide-and-conquer framework. For example, Shang and Cheng [21] found that the block size could be viewed as an alternative regularization parameter in distributed nonparametric regression. Later, Xu et al. [32] proposed a cross-validation approach for selecting the tuning parameter in distributed nonparametric learning.

Inspired by Shang and Cheng [21] and Xu et al. [32], we also propose a data-driven criterion named BS that can be used to select the block size. Specifically, the proposed BS criterion is formulated as the following optimization problem:

minmi=1m(Dn(i)Dn¯)2+λlog(1/m), (6)

where Dn(i) is the individual test statistic, Dn¯ is the average of test statistics and λ is a tuning parameter related to the full sample size.

The first term of (4) represents the volatility of the estimated test statistic of each block and the second term represents the computation cost. The volatility between each estimator will increase, but the overall computational cost will decrease as m grows. In general, we can select λ=log(N). Figure 2 displays the objective function of (4) versus m based on 100 independent trials under null distribution with N(0,1). It can be seen that the objection function starts to decrease and then increases from some point as the block size increases. Therefore, an optimal m can be obtained in order to balance the test performance and computational cost.

Figure 2.

Figure 2.

The relationship between the objective function and m in different sample sizes.

Next, we will check the performance of the BS criterion via simulation studies. For DCR-KS, it holds that the asymptotic validity of a test can be achieved with m<N1/2. Besides, m is in general no less than 10 considering that the null distribution of test statistics in aggregation is derived asymptotically. Hence, we select m for DCR-KS in the range of {m|m=Nr, r(0,0.5)}. Similarly, we select m for DCP-KS in the range of {m|m=Nr, r(0,0.4)}. The two-node samples are independently generated from underlying distributions being Normal, Uniform and Gamma, respectively. Size and power are calculated as the proportions of rejection based on 1000 independent trials under the significant level of α=5%. The results are summarized in Tables 2 and 3.

Table 2.

Estimated probability of type I error for sample size N.

  N(0,1) U(0,1) Ga(1/2,1/2)
N DCR-KS DCP-KS Oracle a DCR-KS DCP-KS Oracle DCR-KS DCP-KS Oracle
105 0.056 0.055 0.058 0.056 0.047 0.059 0.052 0.053 0.054
106 0.049 0.051 0.052 0.048 0.043 0.044 0.048 0.048 0.055
107 0.054 0.050 0.057 0.049 0.053 0.052 0.056 0.045 0.052
108 0.052 0.051 0.047 0.052 0.048 0.052 0.052 0.047 0.048

aThe results of Oracle correspond to those based on the full sample.

Table 3.

Power of the DCR-KS test (the DCP-KS test is in the parenthesis) for different sample sizes under normal distribution.

Distribution class 100,000 200,000 400,000 600,000 800,000 1,000,000 Oracle a
Different mean and same variance
N(0,1)vs. N(0.010,1) 0.08 (0.06) 0.12 (0.11) 0.17(0.23) 0.19 (0.35) 0.25 (0.51) 0.36 (0.66) 1
N(0,1)vs. N(0.015,1) 0.13 (0.10) 0.23 (0.31) 0.40 (0.69) 0.54 (0.90) 0.67 (0.98) 0.81 (1.00) 1
N(0,1)vs. N(0.020,1) 0.19 (0.25) 0.39 (0.66) 0.74 (0.97) 0.90 (0.99) 0.97 (1.00) 0.98 (1.00) 1
N(0,1)vs. N(0.025,1) 0.33 (0.52) 0.68 (0.92) 0.94 (1.00) 0.99 (1.00) 1.00 (1.00) 1.00 (1.00) 1
N(0,1)vs. N(0.030,1) 0.49 (0.78) 0.86 (0.99) 0.99 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1
Same mean and different variance
N(0,1)vs. N(0,1.022) 0.06 (0.09) 0.12 (0.31) 0.15 (0.72) 0.20 (0.93) 0.23 (0.99) 0.30 (0.99) 1
N(0,1)vs. N(0,1.042) 0.18 (0.80) 0.37 (0.99) 0.71 (1.00) 0.90 (1.00) 0.97 (1.00) 0.99 (1.00) 1
N(0,1)vs. N(0,1.062) 0.44 (1.00) 0.88 (1.00) 0.99 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1
Different mean and variance
N(0,1)vs. N(0.01,1.022) 0.12 (0.17) 0.18 (0.53) 0.32 (0.92) 0.45 (0.98) 0.60 (0.99) 0.73 (1.00) 1
N(0,1)vs. N(0.02,1.042) 0.44 (0.98) 0.81 (1.00) 0.99 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1
N(0,1)vs. N(0.03,1.062) 0.92 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1

aThe results of Oracle correspond to those based on full sample.

Table 2 shows the test size for two-node samples taken from N(0,1), U(0,1) and Ga(1/2,1/2). We see that the two testing schemes can control the type I error for three different distributions, and the empirical sizes are closer to both the nominal level α=0.05 and the level of oracle with the full sample. Table 3 only shows the power performance of the two testing schemes when the two-node samples come from FN(μ1,σ12) and GN(μ2,σ22) with various values of (μ1,σ1,μ2,σ2). It can be seen that the test power improves as the sample size increases and the difference between F and G increases. Similar results are obtained by the Uniform and Gamma cases.

The above results not only illustrate the feasibility of the BS criterion but also reflect the effectiveness of our testing schemes. In addition, we find that the selected average block size is around N0.4 for DCR-KS and N0.3 for DCP-KS. Therefore, we can also choose an empirical block size mN0.4 for DCR-KS or mN0.3 for DCP-KS instead of the BS criterion in practice.

Remark 3.1

In practice, the massive data are recommended to be divided into a small number of blocks for some highly configured computing equipment. However, we can conclude in the discussion of Remark 2.1 that DCR-KS is not suitable for small m, while DCP-KS can be applied when m is not too small such as greater than 10. Here, we develop another testing scheme for tiny m ( m10), called DCS-KS. The key idea of DCS-KS is that the sum of m block test statistics is asymptotically distributed as the mth order convolution of the Kolmogorov distribution under H0 for large N and tiny m. Specifically, the DCS-KS test statistic is

Um,nDCS=i=1mDn(i),

where Dn(i) is the two-sided KS test statistic for ith data block, i=1,2,,m. Then, the testing scheme is given by

reject H0,if Um,nDCS>L1αm,

where L1αm is the (1α)th quantile of KKKm(z) that is the mth order convolution function of the Kolmogorov distribution. The explicit expression of KK(z) is given when m = 2 in Appendix 3. We can get the approximation of the critical value L1αm via the Monte Carlo method for m>2.

Similarly, we assess the performance of the DCS-KS testing scheme for tiny m with sample size N=105. The results are summarized in Tables 4 and 5. It can be seen that DCS-KS performs ideally in terms of test size and power for tiny m. Specifically, the size of the DCR-KS test is close to nominal level 0.05 for all cases and is less sensitive to the choice of m for tiny m. In addition, the power of the DCR-KS test increases and is closer to that of oracle with full sample as the distribution difference increases. Therefore, the DCS-KS is an efficient testing scheme for tiny m.

Table 4.

The estimated probability of type I error of the DCS-KS test for different block sizes.

  DCS-KS
m N(0,1) U(0,1) Gamma(1/2,1/2)
2 0.052 0.049 0.050
4 0.048 0.051 0.050
5 0.058 0.050 0.053
8 0.054 0.052 0.057
10 0.052 0.054 0.056

Table 5.

Power of the DCS-KS test for different block sizes under normal distribution.

m 2 4 5 8 10 Oracle a
N(0,1)vs. N(0.01,1) 0.45 0.36 0.30 0.23 0.21 1
N(0,1)vs. N(0.02,1) 0.97 0.92 0.90 0.83 0.79 1
N(0,1)vs. N(0.03,1) 1.00 1.00 1.00 1.00 0.99 1

aThe results of Oracle correspond to those based on full sample dataset.

According to the above discussion, we give some suggestions in applications. (a) For some highly configured computing equipment, the massive data are recommended to be divided into a small number of blocks ( 2m10) and we strongly recommend to use of the DCS-KS testing scheme. (b) DCR-KS and DCP-KS are useful for general computing equipment. It is worth noting that DCR-KS is not suitable for small m. If one needs to adopt the aggregation strategy on test decision for small m, the likelihood ratio test for Bernoulli observations can be used, but this testing scheme is conservative in terms of test size (see Appendix 4).

In the following topics, we only provide some new methods based on the DCR-KS testing scheme, and some related studies based on the DCP-KS testing scheme are similar.

4. Fraud detection

When the data are stored in a distributed manner, we often solve statistical problems via the divide-and-conquer technique, which is based on the assumption that the distributed node datasets come from the same underlying distribution F. However, there might be some node datasets which contain fraud samples from other distributions. Therefore, in order to improve the efficiency of the distributed estimation, we need to detect which node contains fraud data. We will study this problem based on the proposed testing schemes in a distributed architecture.

4.1. Algorithm

In a distributed architecture, there are s + 1 node machines where the s + 1 massive datasets {X,Y1,Y2,,Ys} are stored. And the dataset X is a true sample from the underlying continuous distribution F. Our goal is to detect which node contains fraud data in the remaining s nodes. This detection problem can be transformed to test whether the remaining s datasets come from the same distribution as the true dataset X. To this end, we perform s hypothesis testing procedures on the equivalence of two distributions via the DCR-KS testing scheme (see Algorithm 1).

4.1.

Based on the above procedure, we can conclude that datasets in C2 have different distributions from the dataset X, which indicates that there are some fraud data in these datasets. Therefore, before conducting statistical inference in distributed structure with s + 1 node machines, we should pick out these fraud datasets to improve the efficiency of inference.

4.2. Simulation studies

In this subsection, we will use simulation to discuss the performance of fraud node detection with the DCR-KS test in the distributed structure and the effect of fraud detection on the distributed statistical estimation

In this simulation, there are four nodes where we generate the true node dataset X from N(0,1) and the other three-node datasets {Y1,Y2,Y3} are generated from N(μ,σ2) with various values of (μ,σ). We conduct s = 3 hypothesis tests via DCR-KS approach with nominal level α=5%. All node datasets have the same sample N for each dataset. We evaluate the performance of fraud detection in terms of classification accuracy, which is the proportion of tested datasets where it is classified correctly. In addition, we compute the overall mean estimators based on DCR-KS detection results, the one-shot distributed approach [34] and sample dataset X, which are denoted as DCM-KS, DCM and M. Here, the one-shot distributed approach refers to the average individual mean estimators for all node datasets. Hence, we evaluate the effect of fraud detection on the distributed statistical estimation in comparison to DCM-KS, DCM and M. All simulations are conducted with 1000 replicates, and the simulation results are listed in Table 6.

Table 6.

Classification accuracy and mean estimation performance with true node dataset XN(0,1).

Y1 Y2 Y3 Accuracy DCM DCM-KS M
N=100,000
N(0,1) N(0,1) N(0,1) 94.23% −0.00004 0.00002 −0.0001
N(0,1) N(0,1) N(0.05,1) 95.37% 0.0125 0.0003  
N(0,1) N(0.05,1) N(0.05,1) 96.53% 0.0249 0.0007  
N(0.04,1) N(0.05,1) N(0.06,1) 92.67% 0.0374 0.0037  
N=200,000
N(0,1) N(0,1) N(0,1) 95.67% −0.00006 −0.00006 −0.00007
N(0,1) N(0,1) N(0.05,1) 97.07% 0.0125 0.00003  
N(0,1) N(0.05,1) N(0.05,1) 98.47% 0.0250 −0.00004  
N(0.04,1) N(0.05,1) N(0.06,1) 97.13% 0.0375 0.0016  

From the perspective of classification accuracy, the classification accuracy reaches more than 95% for common cases and the classification error rate can be kept within 10% for the extreme two cases. The classification accuracy will improve as the sample size increases in all cases. In addition, from the estimation results of the overall mean, the performance of estimation is becoming better as the same distributed nodes are increasing and accuracy is becoming worse as the number of nodes with false distributions is increasing for all cases. It is also seen that the DCM-KS is better than the DCM for different distribution cases, which explains the importance of fraud node detection in distributed statistical inference. These results reveal the feasibility and efficiency of the proposed DCR-KS test for fraud detection.

5. The distribution-based classification for multi-nodes

In the previous section, we have studied the fraud detection in a distributed architecture, which is equivalent to the binary classification problem based on the prior information of some real data. In this section, we will discuss the distribution-based node classification problem in a distributed architecture, that is, categorizing the node dataset with the same distribution into one category. The study of the classification approach can not only test whether the multi-node distributions differ but also classify different distributions, which contributes to the subsequent statistical analysis in a distributed architecture.

5.1. The distribution-based classification algorithm for multi-nodes

In a distributed architecture, there are s node machines that store the s massive datasets {Y1,Y2,,Ys} from underlying continuous distribution {F1,F2,,Fs}, where there exists FiFj for some ij. Our goal is to cluster datasets with the same distribution and divide the s datasets into different categories. In this regard, we propose a classification method on distribution through the DCR-KS test, see Algorithm 2.

5.1.

5.2. Simulation studies

We will conduct this classification approach with simulated data. We choose s = 4 and the same sample size N=100,000, and discuss the following five cases.

  • Case I: YiN(0,1), i = 1, 2, 3, 4

  • Case II: YiN(0,1), i = 1, 2, 3, Y4N(0.05,1)

  • Case III: YiN(0,1), i = 1, 2, YjN(0.05,1), j = 3, 4

  • Case IV: YiN(0,1), i = 1, 2, Y3N(0.05,1), Y4N(0.10,1)

  • Case V: Y1N(0,1), Y2N(0.05,1), Y3N(0.10,1), Y4N(0.15,1)

In addition, we define five classification evaluation indices, that is,

  • Correct Classification Number Proportion (CCNP): the proportion of the cases where the number of the estimated categories is equal to the number of real categories.

  • Correct Classification Proportion(CCP): the proportion of the cases where the estimated categories are same as the real categories.

  • Classification Purity (CP): the proportion of the cases where all datasets in each category come from the same distribution.

  • Meterogeneous Misjudgment Proportion (MMP): the proportion of the cases where some datasets that come from the same distribution is classified into different categories.

  • Heterogeneous Misjudgment Proportion (HMP): the proportion of the cases where some datasets that come from the different distribution is classified into one category.

All simulations are conducted with 1000 replicates. And the summary of simulation results is listed in Table 7.

Table 7.

Classification performance in different cases with significance level α=0.05.

Case Class CCNP CCP CP MMP HMP
I 1 83.80% 83.80% 100% 16.2%
II 2 86.30% 84.80% 96.70% 13.4% 3.4%
III 2 89.10% 84.70% 95.30% 15.1% 15.1%
IV 3 89.50% 87.00% 93.80% 9.3% 6.2%
V 4 92.90% 92.90% 92.90% 7.10%

Note: ‘–’ denotes null value.

It can be seen that our proposed classification is feasible in terms of the classification evaluation indices. Overall, both the CCNP and the CCP are over 80%, and the CP is over 90% for all cases. It can also be seen from MMP and HMP that the misjudgment proportion is below 20% for all cases. Therefore, our proposed classification approach is applicable in tolerance of 20%. In addition, the accuracy of classification is higher when the difference between the datasets is larger, which indicates that the proposed classification method prefers to deal with the cases where there are significant differences between the node datasets.

6. Application

We apply the DCR-KS and DCP-KS testing schemes to analyze the Central England Temperature (CET) dataset, which is a meteorological time series dataset consisting of 246 years ( 1772–2017) of average daily temperature in Central England. The CET dataset represents the longest continuous thermometer-based temperature record on earth and was previously analyzed by Berkes et al. [4] and Zhang et al. [33] in the context of inference for functional time series. The existing studies conclude that there are two possible change points in the CET dataset, the years 1927 and 1993. For this dataset, we are interested in testing whether the distribution of the data containing the change points is the same as that of the data before the change point, which will provide new inferential evidence for the change points.

In our analysis, we treat the dataset as a univariate time series sample of daily average temperature. The sample size is 246×365=89,790, where we ignore leap years. We remove seasonality by subtracting each observation from the mean temperature for that calendar day across 246 years and then divide this massive dataset into three groups with the same sample size, where group II and group III, respectively, contain one change point found in the literature. The grouped datasets are shown in Table 8. We use the Gaussian kernel density estimation (KDE) and histogram to display the distributions of the three datasets. It can be seen from Figure 3 that the three distributions are slightly different.

Table 8.

Summary of the CET dataset partition.

  Group
Summary I II III
Time-interval 1772–1853 1854–1935 1936–2017
Sample size N 29,930 29,930 29,930

Figure 3.

Figure 3.

The KDE curve and histogram of the three datasets.

We apply the proposed DCR-KS and DCP-KS testing schemes to test the equality of three group distributions. Denote F1,F2 and F3 as distributions of groups I, II and III. We, respectively, choose m = 73 and m = 41 as block sizes of the DCR-KS and DCP-KS testing schemes. Table 9 displays the results of DCR-KS, DCP-KS and the conventional KS test with full sample oracle at the significant level of α=5%. Note that the results of the DCR-KS and DCP-KS test show that three null hypotheses are all significantly rejected (p<0.0001), which is consistent with the results of Oracle. This example shows that the two-node distributed KS testing schemes are applicable.

Table 9.

Test result of distribution on CET three datasets with 5% level.

  Null hypothesis H0
Method F1=F2 F1=F3 F2=F3
DCR-KS Rejected Rejected Rejected
DCP-KS Rejected Rejected Rejected
Oracle a Rejected Rejected Rejected

aThe results of Oracle correspond to those based on full sample dataset.

7. Conclusion

In this paper, we focus on the distributed multiple sample testing schemes for a two-node machine, called DCR-KS and DCP-KS, to test the equality of distribution in massive data. The main idea of the testing scheme is to design a proper divide-and-conquer strategy. The proposed testing schemes are shown to be able to provide feasible nonparametric testing solutions for decentralized data storage architectures and can be applied in many areas. In addition, we develop the fraud detection and the distribution-based classification methods via the proposed testing schemes which reveal the value of the distributed testing schemes.

In the divide-and-conquer framework, the aggregation of local results in general adopts simple averaging rules [16,35]. Besides, some authors have proposed the weighted averaging rules for aggregation [17,22]. The weighted rules are proposed mostly because the local results don't satisfy the identically distributed assumption in some scenarios or models, and it is necessary to weigh local results in order to achieve the asymptotic optimality. But in this paper, we adopt the simple averaging rule to conquer because it holds that each local result is independent and identically distributed under the null hypothesis so that each local result plays an equally important role in aggregation.

Although the KS test is focused on in this paper, the proposed distributed testing schemes can be easily extended to other two-sample tests due to their aggregation nature. Because the computation of the Kolmogorov distribution is a little bit complicated, two-sample tests with asymptotic null distributions computed easily are recommended, for example, the likelihood ratio-based test approach proposed by Xing et al. [31] whose limiting null distribution is the conventional chi-square.

Acknowledgments

The authors are very grateful to the editor and reviewers for their valuable comments that led to the substantial improvement of the paper.

Appendices.

Appendix 1. Proof of Proposition 2.1

Proof.

Let

Δn=supz>0|P(Dn(i)z|H0)K(z)|.

According to Theorem 2 in [5], for the ith data block, we have

supz>0|P(Dn(i)z|H0)K(z)|eπ2n1/2+4en1,

then as n, there holds

supz>0|P{Dn(i)z|H0}K(z)|=O(1/n).

Appendix 2. Proof of Theorem 2.2

Proof.

According to Proposition 2.1, we have

αnα=O(1/n),as n.

Secondly, based on hn(1),hn(2),,hn(m), from the Central Limit Theorem, we have

i=1mhn(i)mαnmαn(1αn)|H0dN(0,1),m+.

Let ηn=αnα, then

P{Um,nat|H0}=Pi=1mhn(i)mαmα(1α)t|H0=Pi=1mhn(i)mαn+m(αnα)m(αn(αnα))(1αn+(αnα))t|H0=Pi=1mhn(i)mαn+mηnmαn(1αn)tm(αnηn)(1αn+ηn)mαn(1αn)|H0=Pi=1mhn(i)mαnmαn(1αn)tm(αnηn)(1αn+ηn)mηnmpn(1αn)|H0mΦt(αnηn)(1αn+ηn)αn(1αn)limmmηnαn(1αn)nΦtlimn,mmηnαn(1αn).

With m=o(n), limnαn=α, and ηn=O(1/n), we have

limn,mmηnαn(1αn)=0,

that is,

limm,n+P(Um,nat)=Φ(t).

Appendix 3. The expression of KK(z) for m = 2

Suppose that random variables X1 and X2 are independent and identically distributed as K(x), and

K(x)=12k=1+(1)k1e2k2x2,

where x>0. The density function of the Kolmogorov distribution is

f(x)=K(x)=8k=1+{(1)k1(k2x)e2k2x2}.

Next, we compute the distribution of X1+X2. For any y>0, there holds

P(X1+X2y)=x1+x2yk(x1,x2)dx1dx2=0+f(x)K(yx)dx=116j=1+k=1+j20+xe2j2x22k2(yx)2dx=116j=1+k=1+j2e2k2j2k2+j2y20+xe2(k2+j2)(xk2k2+j2y)2dx.

Let

a=e2k2j2k2+j2y2,I=a0+xe2(k2+j2)xk2k2+j2y2dx,

then we have

I=a0+xk2k2+j2ye2(k2+j2)xk2k2+j2y2dx+a0+k2k2+j2ye2(k2+j2)xk2k2+j2y2dx=a4(k2+j2)e2(k2+j2)k4y2(k2+j2)2+ak2yk2+j22π2k2+j2Φ2k2yk2+j2=14(k2+j2)e2k2y2+k2πy2(k2+j2)3/2e2k2j2k2+j2y2Φ2k2yk2+j2,

where Φ(x) is the standard normal distribution function. Hence, it yields that

P(X1+X2y)=116j=1+k=1+j24(k2+j2)e2k2y2+j2k2πy2(k2+j2)3/2e2k2j2k2+j2y2Φ2k2yk2+j2,

that is,

KK(z)=116j=1+k=1+j24(k2+j2)e2k2z2+j2k2πz2(k2+j2)3/2e2k2j2k2+j2z2Φ2k2zk2+j2.

Appendix 4. Another aggregation scheme for test decision results

In the DCR-KS test procedure, the (1α)th quantile of the asymptotic distribution of the test statistic Um,na is used as the critical value in the aggregation step which implies that m should not be too small. In this case, an alternative aggregation step for the DCR-KS test, called DCR*-KS, can be adopted. Specifically, we still consider the following hypothesis testing in the aggregation step

H0a:αnα,versusH1a:αn>α. (A1)

The constructed test statistic is

Um,na=i=1mhn(i),

which is the likelihood ratio test statistic for (D1), Then, the DCR*-KS testing scheme is

RejectH0ifUm,na>T1αm,

where T1αm is the (1α)th quantile of the Binomial distribution B(m,α).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Adams D.R., Nonparametric statistical tests in business education survey research – the Kolmogorov-Smirnov two-sample test, Delta Pi Epsilon Journal 19 (1977), pp. 32–42. [Google Scholar]
  • 2.Banerjee M., Durot C., and Sen B., Divide and conquer in nonstandard problems and the super-efficiency phenomenon, Ann. Stat. 47 (2019), pp. 720–757. [Google Scholar]
  • 3.Battey H., Fan J., Liu H., Lu J., and Zhu Z., Distributed testing and estimation under sparse high dimensional models, Ann. Stat. 46 (2018), pp. 1352–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Berkes I., Gabrys R., Horváth L., and Kokoszka P., Detecting changes in the mean of functional observations, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 71 (2009), pp. 927–946. [Google Scholar]
  • 5.Byambazhav D., Estimates of the rate of convergence in the two-sided Smirnov criterion, Sib. Math. J. 28 (1987), pp. 725–731. [Google Scholar]
  • 6.Chen X., Liu W., and Zhang Y., First-order Newton-type estimator for distributed estimation and inference, preprint (2018). Available at arXiv:1811.11368.
  • 7.Csrg S. and Faraway J.J., The exact and asymptotic distributions of Cramér-Von Mises statistics, J. R. Stat. Soc. Ser. B (Methodol.) 58 (1996), pp. 221–234. [Google Scholar]
  • 8.Csörg M. and Révész P., Strong Approximations in Probability and Statistics, Academic Press, New York, 1981. [Google Scholar]
  • 9.Fan J., Wang D., Wang K., and Zhu Z., Distributed estimation of principal eigenspaces, preprint (2017). Available at arXiv:1702.06488. [DOI] [PMC free article] [PubMed]
  • 10.Garber D., Shamir O., and Srebro N., Communication-efficient algorithms for distributed stochastic principal component analysis, preprint (2017). Available at arXiv:1702.08169.
  • 11.Han Y., Mukherjee P., Ozgur A., and Weissman T., Distributed statistical estimation of high-dimensional and nonparametric distributions, 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 2018, pp. 506–510.
  • 12.Hodges J.L., The significance probability of the Smirnov two-sample test, Ark. Mat. 3 (1958), pp. 469–486. [Google Scholar]
  • 13.Jordan M.I., Lee J.D., and Yang Y., Communication-efficient distributed statistical inference, preprint (2016), Available at arXiv:1605.07689. [Google Scholar]
  • 14.Kleiner A., Talwalkar A., Sarkar P., and Jordan M.I., A scalable bootstrap for massive data, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 76 (2014), pp. 795–816. [Google Scholar]
  • 15.Lee J.D., Liu Q., Sun Y., and Taylor J.E., Communication-efficient sparse regression, J. Mach. Learn. Res. 18 (2017), pp. 1–30. [Google Scholar]
  • 16.Li R., Lin D.K.J., and Li B., Statistical inference in massive data sets, Appl. Stoch. Models Bus. Ind. 29 (2013), pp. 399–409. [Google Scholar]
  • 17.Lin N. and Xi R., Aggregated estimating equation estimation, Stat. Interface 4 (2011), pp. 73–83. [Google Scholar]
  • 18.Lin P.C., Wu B., and Watada J., Kolmogorov-Smirnov two sample test with continuous fuzzy data, in Integrated Uncertainty Management and Applications, V.N. Huynh, Y. Nakamori, J. Lawry, and M. Inuiguchi, eds., Advances in Intelligent and Soft Computing Vol 68, Springer, Berlin, Heidelberg, 2010.
  • 19.Senger O., A statistical power comparison of the Kolmogorov-Smirnov two-sample test and the Wald Wolfowitz test in terms of fixed skewness and fixed kurtosis in large sample sizes, China Econ. Rev. (English Version) 12 (2013), pp. 469–476. [Google Scholar]
  • 20.Shamir O., Srebro N., and Zhang T., Communication-efficient distributed optimization using an approximate Newton-type method, preprint (2014). Available at arXiv:1312.7853.
  • 21.Shang Z. and Cheng G., Computational limits of a distributed algorithm for smoothing spline, J. Mach. Learn. Res. 18 (2017), pp. 1–37. [Google Scholar]
  • 22.Shang Z., Hao B., and Cheng G., Nonparametric Bayesian aggregation for massive data, J. Mach. Learn. Res. 20 (2019), pp. 1–81. [Google Scholar]
  • 23.Shorack G.R. and Wellner J.A., Empirical Processes with Applications to Statistics, Wiley, New York, 1986. [Google Scholar]
  • 24.Smirnov N.V., Approximate laws of distribution of random variables from empirical data, Usp. Mat. Nauk 10 (1944), pp. 179–207. [Google Scholar]
  • 25.Szabo B. and van Zanten H., An asymptotic analysis of distributed nonparametric methods, preprint (2017). Available at arXiv:1711.03149.
  • 26.Volgushev S., Chao S.K., and Cheng G., Distributed inference for quantile regression processes, preprint (2017). Available at arXiv:1701.06088.
  • 27.Wang J., Kolar M., Srebro N., and Zhang T., Efficient distributed learning with sparsity, preprint (2016). Available at arXiv:1605.07991.
  • 28.Wang X. and Dunson D.B., Parallelizing MCMC via Weierstrass sampler, preprint (2013). Available at arXiv:1312.4605.
  • 29.Wang X., Yang Z., Chen X., and Liu W., Distributed inference for linear support vector machine, preprint (2018b). Available at arXiv:1811.11922.
  • 30.Weiss N.A., Introductory Statistics, 9th ed., Addison-Wesley, USA, 2012. [Google Scholar]
  • 31.Xing X., Shang Z., Du P., Ma P., Zhong W., and Liu J.S., Minimax nonparametric two-sample test, preprint (2019). Available at arXiv:1911.02171.
  • 32.Xu G., Shang Z., and Cheng G., Optimal tuning for divide-and-conquer Kernel ridge regression with massive data, Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80(2018), pp. 5483–5491.
  • 33.Zhang X., Shao X., Hayhoe K., and Wuebbles D.J., Testing the structural stability of temporally dependent functional observations and application to climate projections, Electron. J. Stat. 5 (2011), pp. 1765–1796. [Google Scholar]
  • 34.Zhang Y., Duchi J.C., and Wainwright M.J., Communication-efficient algorithms for statistical optimization, J. Mach. Learn. Res. 14 (2013), pp. 3321–3363. [Google Scholar]
  • 35.Zhang Y., Duchi J.C., and Wainwright M.J., Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates, J. Mach. Learn. Res. 30 (2013), pp. 592–617. [Google Scholar]
  • 36.Zhang Y. and Xiao L., Distributed optimization for self-concordant empirical loss, preprint (2015). Available at arXiv:1501.00263.

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES