Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 May 10.
Published in final edited form as: J Mach Learn Res. 2020;21:94.

Minimax Nonparametric Parallelism Test

Xin Xing 1, Meimei Liu 1, Ping Ma 2, Wenxuan Zhong 2
PMCID: PMC11086968  NIHMSID: NIHMS1641658  PMID: 38737400

Abstract

Testing the hypothesis of parallelism is a fundamental statistical problem arising from many applied sciences. In this paper, we develop a nonparametric parallelism test for inferring whether the trends are parallel in treatment and control groups. In particular, the proposed nonparametric parallelism test is a Wald type test based on a smoothing spline ANOVA (SSANOVA) model which can characterize the complex patterns of the data. We derive that the asymptotic null distribution of the test statistic is a Chi-square distribution, unveiling a new version of Wilks phenomenon. Notably, we establish the minimax sharp lower bound of the distinguishable rate for the nonparametric parallelism test by using the information theory, and further prove that the proposed test is minimax optimal. Simulation studies are conducted to investigate the empirical performance of the proposed test. DNA methylation and neuroimaging studies are presented to illustrate potential applications of the test. The software is available at https://github.com/BioAlgs/Parallelism.

Keywords: asymptotic distribution, minimax optimality, nonparametric inference, parallelism test, penalized least squares, smoothing spline ANOVA, Wald test

1. Introduction

The assessment of parallelism is a fundamental problem in statistical inference and arises from many applications. For example, in genomic studies, one of primary interest is to detect genes with nonparallel expression patterns in time course studies (Storey et al., 2005; Ma et al., 2009). Another motivating example is in epigenomics, researchers are interested in testing whether the patterns of DNA methylation intensities along genome in the treatment and control groups are parallel or not (Hansen et al., 2012). The abnormal DNA methylation patterns are associated with changes in many important biological processes such as imprinting, X-chromosome inactivation, and aging (Schübeler, 2015). In functional neuroimaging, a common problem is to detect nonparallel signals (Nichols and Holmes, 2002; Orrison et al., 2017) among different brain regions.

There is an immense literature focused on the analysis of the parallelism of trends using linear model-based approaches, ranging from simple ANOVA (Sthle and Wold, 1989) to linear mixed models (Vossoughi et al., 2016). However, the linear model-based approaches have a limited ability to parsimoniously represent non-linear structures in complex data. Nonparametric parallelism comparison methods have drawn huge attention due to the modeling flexibility. Munk and Dette (1998) developed a test statistic through a weighted L2 distances for the regression functions based on similar equal-spaced fixed design. Degras et al. (2011) tested the parallelism of multiple time series based on the L2 distances between the local linear estimator of each individual curve and the global one for time series data when the time points are evenly spaced. Wang (1998) proposed a wavelet-based method to measure the changes of curves. Liu and Wang (2004) compared different nonparametric testing methods and showed that the performances of these tests depend on the shape of the true function. Ma et al. (2009) proposed an approximate F-test to detect nonparallel patterns in time course gene expression data with a more flexible random design.

However, rigorous testing methods with optimal power guarantees are still lacking in the existing nonparametric parallelism literature. The key cause of such research gap is that distinguishing from the simple/linear/polynomial null hypothesis, the parameter space of the null hypothesis for the nonparametric parallelism testing is a nonparametric function class with infinite dimension. How to conduct a rigorous test for such composite functional null hypothesis is still an open question. A major motivation of this article is on developing a nonparametric parallelism testing approach that detects the significance of the nonparallel effect, while guarantees statistical optimality in the sense of minimax testing rate, facilitating the power performance analysis.

In this article, we develop a nonparametric parallelism test based on the decomposition of tensor product reproducing Hilbert space (RKHS) (Wahba, 1990; Gu, 2013; Wang, 2011) under both fixed and random design. Tensor product RKHS provides a flexible space for modeling complex functions; see Wahba et al. (1995), Wood (2003) and reference therein. For the simplicity of description, we consider the case that there are two predictors only. Suppose the response variable Yij is the observed value of the jth subject at the ith time or spatial location for i = 1, ⋯ , n and j = 1, ⋯ , s. Yij depends on two predictors xi1 and xj2 through an unknown bivariate function f(,)H, the tensor product RKHS, where xi1X1=[0,1] is a continuous variable representing the ith time or the i-th spatial location, and xj2X2={0,1} is a discrete variable representing the jth subject in different groups, xj2=1 represents the jth subject in treatment group, otherwise in control group. That is,

Yij=f(xi1,xj2)+ϵij,i=1,n,j=1,,s, (1)

where ϵijs are i.i.d. random noise following a normal distribution with mean zero, variance σ2, and s is the number of subjects. Each subject can be represented by a curve. When s = 2, there are two curves in total and each group has only one curve. When s > 2, we have multiple curves in each group. We assume the i.i.d. random noise since, in many scientific experiments, the random errors are attributed to environmental factors independent of the time points or spatial location. For example, in the fMRI data analysis in Section 6, the error is mostly attributed to the random movement of the head and imaging noise which are independent with the time.

Analogous to the classical ANOVA decomposition, fH has the smoothing spline ANOVA (SSANOVA) decomposition (Wahba, 1990):

f(xi1,xj2)=f00+f10(xi1)+f01(xj2)+f11(xi1,xj2), (2)

where f00 is the grand mean, f10 and f01 are the main effects, and f11 is the nonparallel effect. When f11 = 0 (see the left panel in Figure 1), the curves in two groups are parallel. Then f11 = 0 is equivalent to that f(x〈1〉, 0) and f(x〈1〉, 1) are parallel. When f11 ≠ 0 (see the right panel in Figure 1), the magnitude of f1122 characterizes the significance of the non-parallelism between the treatment and control groups, where f1122=x2=0101f2(x1,x2)dω1, with ω1 as the marginal density of x〈1〉. Statistically, the hypothesis testing for parallelism can be formulated as

H0:f11=0  vs  H1:f110. (3)

We introduce two concrete examples which motivate our study.

Figure 1:

Figure 1:

An illustration of two scenarios of a bivariate function f(x〈1〉, x〈2〉), where x〈1〉 is continuous, x〈2〉 only takes two values, 0 and 1. Left panel: the scenario with f11 = 0, i.e., f(x〈1〉, 0) and f(x〈1〉, 1) are parallel. Right panel: the scenario with f11 ≠ 0, i.e., f(x〈1〉, 0) and f(x〈1〉, 1) are nonparallel.

Example 1. DNA methylation in case-control study. DNA methylation is an essential epigenetic mechanism that regulates gene expression. Aberrant DNA methylation contributes to a number of human diseases including cancer (Stach et al., 2003). In a typical case-control study of DNA methylation (Filarsky et al., 2016), the DNA methylation level, denoted as Yij at the ith location xi1 on the genome for the jth individual in group xj2, can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). A primary focus is to infer whether the DNA methylation levels have different profiles along the genome between the case and control groups, i.e., testing the presence/absence of nonparallel effect f11 as in Equation (3).

Example 2. Neuroimaging using functional magnetic resonance imaging (fMRI). fMRI is a powerful neuroimaging technology for the diagnosis of many brain-related diseases. It measures brain activity by detecting changes associated with blood flow. The primary form of fMRI uses the blood-oxygen-level dependent (BOLD) as signal (Huettel et al., 2004). In many case-control studies, the BOLD signal, Yij, at the ith time xi1 for the jth subject in group xj2 is measured for a particular region of interest (ROI), and can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). The goal is to test whether the BOLD signals in two groups have same patterns along the time, i.e., test the significance of nonparallel effect f11 in Equation (3).

We first establish the minimax lower bound for nonparametric parallelism test in Equation (3) for general testing rules with the aid of tensor product decomposition of RKHS and the information theory. The tensor product decomposition in Equation (2) enables us to quantify the magnitude of nonparalelism by ∥f112, where ∥ · ∥2 is the L2 norm. Intuitively, the smaller ∥f112 is, the harder it is to distinguish the alternative hypothesis from the null. In analyzing the power performance, we consider a slightly different alternative hypothesis,

H1*:f112dn, (4)

where we remove the neighborhood within the dn distance of f11 = 0 from the original alternative H1. Here the sequence dn is called the distinguishable rate (or separation rate) (Ingster and Suslina, 2012; Giné and Nickl, 2015). We first introduce a geometric interpretation of the testing problem in Equation (3), and then establish a general minimax lower bound for the distinguishable rate for the nonparametric parallelism test using the Bernstein k-width in information theory (Pinkus, 2012). Bernstein k-width provides a geometric measure of the distinguishable rate and is easy to evaluate in the tensor product RKHS. Recently, similar technique was also used in analyzing the testing problems over cones and studied in Gaussian sequence models (Wei and Wainwright, 2020).

In addition, we propose a Wald-type test statistic as the squared empirical norm of the penalized least square estimator of f11. We derive its asymptotic null distribution, which satisfies the Wilks phenomenon. The asymptotic distribution of our test statistic is Gaussian, and the testing rule does not depend on any unknown quantities, thus is easy to compute. We can further reduce the computational cost by applying many popular fast computation methods such as fast random kernel methods Alaoui and Mahoney (2015) and subsampling methods such as Ma et al. (2015); Kim and Gu (2004). We note that our proposed Wald-type test distinguishes from the existing nonparametric testing methods as follows. The existing testing procedures mostly consider simple null hypothesis, such as the generalized likelihood ratio test in Fan et al. (2001), the penalized likelihood ratio test in Shang and Cheng (2013), the wavelet based method in Shen et al. (2002), and kernelized Stein method in Liu et al. (2016), whereas we consider a composite null hypothesis. More importantly, there is a nontrivial technical complication in addition to the above model setting difference. The composite null hypothesis H0 : f11 = 0 here defines a nonparametric function in an infinite-dimensional functional space rather than a parametric function in a finite-dimensional parameter space as required in Shang and Cheng (2013), because testing H0 : f11 = 0 is equivalent to testing H0 : f ∈ {f00 + f10 + f01}. Developing the limiting distribution of the test statistic in an infinite-dimensional null hypothesis space and quantifying the testing difficulty are very challenging since the distribution relies on the more delicate tensor product decomposition of the RKHS.

We further prove that the upper bound of the distinguishable rate for the proposed Wald type test matches the established minimax lower bound. Thus the proposed Wald-type test is minimax optimal. To the best of our knowledge, our work is the first one in establishing the minimax nonparametric parallelism test. Based on the Wald-type test statistic, we propose a data-adaptive choice of the regularization parameter with testing optimality guarantee.

The rest of the paper is organized as follows. We introduce the background of tensor product RKHS in Section 2. In Section 3, we introduce a minimax principle and a geometric interpretation of the parallelism testing problem. In Section 4, we derive the minimax lower bound of the distinguishable rate for general parallelism test using the information theory. Section 5 presents various simulation studies demonstrating substantial performance of our testing method, and Section 6 applies the methods to genome-wide anomaly of DNA methylation in chronic lymphocytic leukemia patients and brain function change in patients with Alzheimer disease. We conclude with a few remarks in Section 7. All technical proofs are relegated to the Appendix and Supplementary Material.

2. Background

In this section, we introduce some background of the tensor product RKHS, its tensor product decomposition, together with the penalized least square estimation.

2.1. Reproducing Kernel Hilbert Space

Given an RKHS H with an inner product ,H, there exists a symmetric and square integrable function K(,):X×X such that

f,K(x,)H=f(x), for all fH and xX.

We call K as the reproducing kernel of H. By Mercer’s theorem, any continuous kernel has the following decomposition

K(x,y)=ν=0λνφν(x)φν(y), (5)

where λνs are non-negative descending eigenvalues and φνs are eigen-functions.

We consider the bivariate function f in Equation (1) on the product domain X1×X2. We assume that f is a function in a tensor product RKHS (Lin, 2000)

H=H(1H2. (6)

Given the Hilbert space H1 and H2, H1H2 is defined as the completion of the class of functions with the form i=1Mη1i(x)η2i(y), for η1iH1, η2iH2, and M is any positive integer. We consider H1 as an mth order homogeneous Sobolev space, i.e.,

H1={η1L2[0,1]η1(k) is absolutely continuous and η1(k)(0)=η1(k)(1) for k=0,1,,m1,η1(m)L2[0,1]},

and H2 is a two-dimensional Euclidean space with standard Euclidean norm.

Assume that H1 has the eigenvalue and eigenvector pairs {μi,ϕi}i=0 and H2 has the eigenvalue and eigenvector pairs {νj,ψj}j=12. Then we have the eigenvalue and eigenvector pairs for the kernel function K in H as

{μiνj,ϕiψj}      for i=0,,,j=1,2, (7)

in the decomposition in Equation (5). We refer Equation (7) as the eigensystem for H. We further denote ,H as the product norm induced by the norm on the marginal space H1 and H2 (Lin, 2000).

Using the Riesz representation theorem (Schölkopf et al., 2001), we can easily represent any function fH as in the following Lemma.

Lemma 1 Given the sampling points xij=(xi1,xj2), i = 1, ⋯ , n and j = 1, ⋯ s, for any f in a reproducing kernel Hilbert space H, there exists a set of reproducing kernels Kxij(,) such that

f(x1,x2)=i=1nj=1sαijKxij(x1,x2)+ρ(x1,x2). (8)

Lemma 1 implies that f can be expressed as a sum of a linear expansion of Kxij and a nonlinear function ρ. Notice that when (x1,x2){xij}i=1,,nj=1,,s, we have ρ(x〈1〉, x〈2〉) = 0. Thus, ρ(·,·) can be considered as a residual that quantifies the unknown information of function f. To get an estimate of f, we only need to specify Kxij(,) and estimate αij. Next, we provide a way to construct the reproducing kernels Kxij(,). In order to do that, we need the following two lemmas.

Lemma 2 Suppose K1 is the reproducing kernel of H1 on X1, and K2 is the reproducing kernel of H2 on X2. Then the reproducing kernels of H1H2 on X=X1×X2 is K(x,z)=K1x1,z1K2x2,z2 with x = (x〈1〉, x〈2〉) and z = (z〈1〉, z〈2〉.

Lemma 3 For every Sobolev space H of functions on X, there corresponds a unique reproducing kernel K, which is non-negative definite. If K0 and K1 are both non-negative definite reproducing kernels for H0 and H1, and H0H1={0}, then H0H1 has a reproducing kernel K=K0+K1.

Lemmas 2 and 3 can be easily proved based on Theorems 2.3 to 2.6 in Gu (2013). Lemma 2 states that the reproducing kernel of the tensor product space is the product of the reproducing kernels. Lemma 3 states that the reproducing kernel of a tensor sum space is the sum of the reproducing kernels. Therefore, to construct Kxij(,), we introduce the decomposition of tensor product space in the following part.

2.2. Decomposition of Tensor Product Space

For any η1H1 and η2H2, define the averaging operators A1:η101η1(x)dx and A2:η212k=12η2(k) where η2(k)=ekTη2, ek is the unit vector with the kth element one and all other elements zeros. We have H1 and H2 with the following tensor sum decomposition H01H11 and H02H12 respectively, where H01={A1η1|η1H1}, H02={A2η2|η2H2}, H11={(IA1)η1|η1H1}, H12={(IA2)η2|η22}, and I is the operator. Thus H has the following tensor sum decomposition

H=(H01H02)(H11H0(2)(H01H12)(H11H12), (9)

and for any fH1H2, we have

f=f00+f10+f01+f11, (10)

where f00=A1A2fH01H02, f10=(IA1)A2fH11H02, f01=A1(IA2)fH01H12 and f11=(IA1)(IA2)fH11H12. Thus, any function fH can be decomposed uniquely as : f00 the interception, f10 and f01 the marginal effects and f11 the two-way interaction term.

Denote the reproducing kernels of H01, H02, H11, H12 as K01, K11, K02, K12, respectively. Specifically, K01(x1,z1)=1 and K11(x1,z1) is defined as (−1)m−1k2m(z〈1〉x〈1〉) for the mth order homogeneous subspace where kr(·) is the rth order scaled Bernoulli polynomials (Abramowitz and Stegun, 1964; Gu, 2013) and 1(·) is the indicator function. K02(x2,z2)=1/2 and K12(x2,z2)=1(z2=x2)1/2 on X2. Let Hll=Hl1Hl2 with reproducing kernel Kll, where

Kll(xij,xij)=Kl1(xi1,xi1)Kl2(xj2,xj2),

for ,′ ∈ {0, 1}. The induced inner product of Hll is denoted as 〈fℓℓ, gℓℓℓℓ, where fℓℓ and gℓℓ are projections of f and g on Hll respectively, ,′ ∈ {0, 1}. Notice that the metrics induced by inner products 〈fℓℓ, gℓℓℓℓ are not necessarily of the same scale for different ℓℓ. The inner product for H can be defined as

f,gH=llθll1fll,gllll, (11)

where θℓℓs re-scale the metrics on different Hll, 〈·,·〉ℓℓ is the restricted norm of ,H on Hll.

Based on Lemmas 2 and 3, we can easily show that the reproducing kernels associated with Equation (11) is K(xij,xij)=l,lθllKll(xij,xij) with ,′ = 0, 1. Thus, given the sampling points xij=(xi1,xj2) for i = 1, ⋯ , n and j = 1, ⋯ , s, the kernel function in H is a bivariate function depending on xij, i.e.,

Kxij(x1,x2)=θ002+θ01(1(x2=xj2)12)+θ1012K11(x1,z1)+θ1112(1(x2=xj2)12)K11(x1,z1), (12)

and accordingly f(x1,x2)=ijαijKxij(x1,x2)+ρ(x1,x2) by Lemma 1.

In the function decomposition in Equation (10), it is easy to verify that f00H00={g:g={(θ00θ01)/2}ijαij}. As f00 is a constant for any x〈1〉 and x〈2〉, it is analogous to the ground mean in classical ANOVA models. Similarly, we have f01H01={g:g=θ01ijαij1(x2=xj2)}. Recall that xj2 can only be either 0 or 1, we can rewrite f01 as 1(x2=0)β0+1(x2=1)β1, where β0=j=1s(i=1nαij)1(xj2=0) and β1=j=1s(i=1nαij)1(xj2=1).

We remark that f00 and f01 are all in a finite-dimensional space. The space H10 (where f10 lies in) spanned by the third term in the right hand side of Equation (12) is, however, an infinite-dimensional space, because we have uncountable xX1. The function can be expressed as a linear combination of the observed reproducing kernels plus a residual that quantifies the unobserved reproducing kernels, i.e., H10={g:g=12i=1n(θ10j=1sαij)K11(x1,z1)+ρ2}. Notice that function in this space only changes as we change x〈1〉. Thus, the third term in right hand side of (12) can be used to quantify the effect of the continuous variable such as the temporal effect. The forth term in the right hand side of Equation (12) varies for both continuous variable and the case-control indicator, thus it is the term that can catch different functional patterns between the case and control. Similarly, the space spanned by the last addend is also an infinite-dimensional space because we still have an infinite number of unobserved kernel functions in addition to the n × s observed kernel functions. Thus, we he f11H11={g:g=12θ11ijαij(1(xj2=x2)12)K11(x1,z1)+ρ12}. Clearly, to test if two functions are parallel to each other, we only need to test if f11 = 0.

2.3. Penalized Least Squares

Here we introduce the penalized least square estimate of fH, and the interaction term f11 in Equation (10). Given the sampling points xij=(xi1,xj2) for i = 1, …, n and j = 1, …, s, consider the model space

Hmodel={g:g=i=1nj=1sαijKxij(x1,x2)},

a closed linear subspace of H. αijs are the regression coefficients, and the bivariate residual function ρ(·,·) in Lemma 1 is in Hresidual=HHmodel. Notice that ρ(xi1,xj2)=Kxij(x1,x2),ρ=0 because of the orthogonality constraint between Hmodel and Hresidual. Then, f can be estimated by minimizing the penalized least squares functional as follows:

1nsi=1nj=1s(YijijαijKxij(xi1,xj2))2+λJ(f10+f11), (13)

where the quadratic functional J(f)=J(f10+f11)=f10+f11H2 quantifies the roughness of f10 and f11, the smoothing parameter λ controls the trade-off between the goodness-of-fit and the roughness of f10 and f11. Recall ρ and Kxij(,) are orthogonal to each other. Plugging Equation (8) into J(f), we have

J(f)=ijαij(θ10Kxij10+θ11Kxij11),ijαij(θ10Kxij10+θ11Kxij11)H+ρ,ρH.

Further notice that Kxijll,Kxijll=Kxijll(xi1,xj2) by the reproducible property of reproducing kernels (Gu, 2013). Thus, substituting K and Kll by (12) and f in J(f) by Equation (8), Equation (13) can be rewritten as

ynsKα22+nsλαTQα+nsλρ,ρH, (14)

where y = (Y11, Y21, …, Yns)T, K is the ns × ns matrix with (i + n(j − 1), i′ + n(j′ − 1))th entry 1nsKxij(xi1,xj2), Q is the ns × ns matrix with (i + n(j − 1), i′ + n(j′ − 1))th entry 1ns(θ10Kxij10(xi1,xj2)+θ11Kxij11(xi1,xj2)) and α = (α11, α21, …, αns)T. Similar to Chapter A3 in Gu (2013), we set the rescale parameter θ10 and θ11 to make θ10K10 and θ11K11 contribute equally in penalty term of Equation (14) (see Appendix A.1 for details) and set θ00 and θ01 as one since H00 and H01 are simply one-dimensional Euclidean space. Since ρ does not rely on α, the optimizer of α in minimizing Equation (14) is equivalent to minimizing

α^=arg minαnsynsKα22+nsλαTQα. (15)

The penalized least square estimate of f is then f^(xi1,xj2)=i,jn,sα^i,jKxi,j(xi1,xj2).

As n goes to infinity, we have countable number of kernels and f(x〈1〉, x〈2〉) that the minimizer of Equation (13) resides in an infinite dimensional space spanned by a countable number of kernels, i.e.,

Hmodel={g:g(x1,x2)=ijαijKxij(x1,x2)}.

The nonparallel effect f11 also resides in a subspace that is spanned by a countable number of kernels. We denote the subspace by

H11={f11:f11(x1,x2)=ijαij(1)m12(1(xj(2)=x(2))12)k2m(xi1x1)}.

Here, we did not normalize f11 by the constant scale parameter θ11 for the simplicity of description. The penalized least square estimate of f11H11 is

f^11(x1,x2)=i,jn,sα^ij(1)m12(1(xj2=x2)12)k2m(xi1x1). (16)

With a little abuse of notation, we use f^11 to denote the vector version evaluation of f^11 on ns data points from now on. Plugging in α^ to (16), we have an explicit expression of f^11 as

f^11=K11M1(InsS(STM1S)1STM1)y, (17)

where Ins is the ns dimensional identity matrix, S, M and K11 are reparametrization of the kernel matrices with explicit forms provided in Appendix A.1–“Notation Clarification”.

In Section 4, we will construct a Wald type test statistics based on f^11 for the parallelism test H0 : f11 = 0, and derive its null asymptotic distribution. Before that, we first establish the minimax principle of the parallelism test for general testing rules in the following Section 3.

3. Minimax Principle of the Nonparametric Parallelism Test

Consider the test problem as follows

H0:f11=0vsH1:f112>0. (18)

Given a decision rule ϕn for the testing problem (18), ϕn = 0 if H0 is preferred and 1 otherwise. Then the zero-one loss function is

Loss(ϕn)={ϕn if H0 is true,1ϕn if H1 is true. (19)

The minimax principle requires ϕn to minimize the maximum possible risk, i.e.,

min ϕnmax HE[Loss(ϕn)]=minϕn[max H0E(ϕn|H0 is true)+max H1E(1ϕn|H1 is true)]. (20)

Notice E(ϕn|H0 is true) is the probability of making a type I error and E(1ϕn|H1 is true) is the probability of making a type II error. Intuitively, we choose ϕn to minimize the maximum possible type I error and type II error. Notice that if H0 and H1 are contiguous, we cannot ensure that Equation (20) can be controlled, because there may lie some f11 on the boundary of H0 and H1 for which strikes the balance between acceptance and rejection of the null hypothesis, and an appropriate decision cannot be made. Thus, instead of H1, we consider a slightly different alternative hypothesis (4) and partition the parameter space into three sets: H0+H1*+I, of which I designates the indifference zone 0 < ∥f112 < dn. Because dn clearly separates H0 from H1*, it is referred to as the distinguishable rate (a.k.a the separation rate) (Ingster and Suslina, 2012; Giné and Nickl, 2015). Let

pseudo.risk(ϕn,dn)=sup H0E(ϕn|H0 is true)+sup H1*E(1ϕn|H1* is true). (21)

Then pseudo.risk(ϕn, dn) converges to the risk function E[Loss(ϕn)] as dn goes to zero.

Compared to the risk function, the pseudo.risk is not only a function of a decision rule ϕn but also a function of the distinguishable rate dn. When ϕn is given, we have supH1*E(1ϕn|H1* is true)supH1E(1ϕn|H1 is true) because H1* is a subset of H1. Thus, finding the largest pseudo.risk on H1* for a given ϕn is equivalent to finding the smallest dn with a tolerable pseudo.risk. In another word, finding the maximum possible pseudo.risk over the parameter space can be considered as finding the smallest boundary of H1* such that an appropriate decision ϕn can be made and the risk can be controlled. Meanwhile, for an adequately large given dn, we can always find a decision rule such that the pseudo.risk can reach its minimum value. Let ϕn(dn)=arg minϕn pseudo.risk(ϕn, dn). Then, if dn can reach the smallest value dn, the corresponding ϕn(dn) is the minimax decision. Thus, the essential step to find the minimax decision of pseudo.risk(ϕn, dn) is to find dn such that

dn=arg min dnϕn(dn). (22)

Because dn is an estimate of the distinguishable rate to obtain the minimax test, it is referred to as the minimax distinguishable rate. Clearly, the corresponding decision rule ϕn is the minimax decision rule.

We first introduce a geometric interpretation of the testing problem (18). Geometrically, we can treat E={fH:fH<1/2} as an ellipse with eigenvalues in Equation (7) as axis lengths as shown in Figure 2. For any fE, the projection of f on {f:fE11:=H11E} is f11. The magnitude of nonparallelism can be qualified by ∥f112. The distinguishable rate dn is the radius of the sphere centered at f11 = 0 in H11.

Figure 2:

Figure 2:

Geometric interpretation of the distinguishable rate of the parallelism test.

Intuitively, the testing will be harder when the projection of f on E11 is closer to the origin f11 = 0. We use the Bernstern width in Pinkus (2012) to characterize the testing difficulty. Let Sk+1 be the set of all (k + 1)-dimensional subspaces for any k ≥ 1. For a compact set C, the Bernstein k-width is defined as

bk,2(C)arg maxr0{B2k+1(r)CS for some subspace SSk+1}, (23)

where B2k+1(r) is a (k + 1)-dimensional l2-ball with radius r centered at f11 = 0 in E11. The Bernstein width characterizes the largest ball that can be inserted into a (k+1)-dimensional subspace in E11. Based on the Bernstein width, we give an upper bound of the testing radius, i.e., for any f projected in the ball with radius less than this upper bound, the minimum pseudo.risk is larger than 1/2.

Lemma 4 For any fH, we have

infϕn pseudo.risk(ϕn,dn)1/2

for all

dnrBsup{δ|δ12nσ(kB(δ))1/4}

where kB(δ)arg maxk{bk1,22(H11)δ2} is the Bernstein lower critical dimension, and rB is called the Bernstein lower critical radius.

Lemma 4 shows that when dn is less than rB, there has no test can distinguish the alternative hypothesis from the null. In order to achieve a non-trivial power, we need dn to be larger than the Bernstein lower critical radius rB, which is determined by the Bernstein lower critical dimension kB(δ). In the next lemma, we provide the lower bound for kB(δ).

Lemma 5 Let {ρi}i=1 be eigenvalues of H11. We have

kB(δ)>arg maxk{ρkδ} (24)

Plugging in the lower bound of kB(δ) derived in Lemma 5 to Lemma 4, we calculate a lower bound for rB based on the decay rate of eigenvalues. rB is served as a minimax lower bound for the distinguishable rate. The following theorem summarizes the minimax distinguishable rate for the testing problem (18).

Theorem 6 (Minimax lower bound for distinguishable rate) In the nonparametric model (1) with SSANOVA (2). Suppose fH, where H=H1H2 with H1 as the mth order Sobolev space1, and H2 as a two-dimensional Euclidean space. The minimax distinguishable rate for testing hypotheses (18) is achieved at dnn2m/(4m+1).

Theorem 6 provides a general guidance for justifying a local minimax test, i.e., there is no test can distinguish the alternative from null if dnn−2m/(4m+1). The proof of Theorem 6 is presented in the Appendix. Essentially, for any test ϕn that is defined by a family of type I error α=E(ϕn) and by the supremum of the type II error δ=supH1*E(1ϕn|H1* is true), we need ϕn converges to zero faster than dn to ensure the distinguishability of the null distribution. We further remark that the minimax rate for nonparametric estimation is nm/(2m+1) (Yang et al. (2017)) which is higher than the minimax distinguishable rate n−2m/(4m+1). In the next section, we will introduce a Wald type test for the hypothesis testing (18) with the separation rate dn achieves the lower bound n−2m/(4m+1) indicating our proposed test is minimax optimal.

4. Wald Type Parallism Test

In this section, we propose a Wald type test statistics based on the penalized least squares estimate of f11, and derive the asymptotic distribution of the test statistics. We further prove an upper bound of the distinguishable rate of the Wald type test which matches the minimax lower bound established in Theorem 6.

4.1. Wald Type Test and Asymptotic Distribution

The nonparallel effect of the curves between the case group and the control group is measured by the magnitude of f1122. The nonparallel test in Equation (3) is equivalent to

H0:fHmodelH11vsH1:fHmodel

or equivalently, H1:f11H11. First, notice that the null hypothesis in Equation (18) is a composite hypothesis as the null hypothesis defines a class of functions in HmodelH11. Second, H0 defines an infinite dimensional parameter spaces as n → ∞, the assumptions of Neyman-Pearson Lemma cannot be satisfied. Thus the uniformly most powerful test may not exist in general. To overcome the difficulty, we propose a Wald-type test

Tn,λ=1nsf^1122 (25)

and show its minimax optimality.

Since Yij follows Equation (1) with f satisfying the SSANOVA decomposition in Equation (2), we can replace each element in vector y by f00(xi1,xj2)+f10(xi1,xj2)+f01(xi1,xj2)+f11(xi1,xj2)+ϵij. Then plug in the expression of f^11 in Equation (17) to Tn,λ, we have

Tn,λ=1nsK11M1(InS(STM1S)1STM1)(f00+f10+f01+f11+ϵ)22,

where f00, f10, f01 and f11 are ns dimensional vectors with the ijth entry f00(xi1,xj2), f10(xi1,xj2), f01(xi1,xj2) and f11(xi1,xj2) respectively, and ϵ is the ns dimensional stochastic error that follows a normal distribution with mean 0 and variance σ2Ins. Because f00, f10 and f01 are in the space that is orthogonal to the space spanned by K11, and f11 = 0 under the null hypothesis, Tn,λ can be further simplified as

Tn,λ=1nsK11M1(InsS(STM1S)1STM1)ϵ22. (26)

A detailed discussion of this simplification will be provided in Lemma 12 in Appendix.

Next, we develop the null limiting distribution of Tn,λ as n goes to infinity. In the derivation, we only require the number of subjects s to be finite. This requirement is desired in real applications since the number of subjects in an experiment is usually limited. For example, due to the high sequencing cost, there are usually only tens of sample sequenced in the DNA methylation studies.

We consider the following two designs.

Quasi-Uniform Design : x11,x21,,xn1~iidω(x1) where ω is the marginal density of x〈1〉. For any x〈1〉 ∈ [0, 1], there exist two constants c1, c2 > 0 such that c1ω(x〈1〉) ≤ c2 (Eggermont and LaRiccia, 2001).

Uniform Design: x11,x21,,xn1 are evenly spaced on [0, 1].

The above two designs are commonly used in scientific investigations. For example, in fMRI experiments, the sampling points on the time domain are usually measured with equal-time intervals. Thus, they are assumed to follow uniform design. On the other hand, the DNA methylation sites are randomly scattered on DNA sequence. Therefore, they are assumed to follow a quasi-uniform design.

Theorem 7 For both the uniform design and the quasi-uniform design, if the smoothing parameter λ=O(nc1) for any fixed c ∈ (0, 1), we have

Tn,λμn,λσn,λdN(0,1) as n,

where μn,λ = σ2 Tr(Δ)/(ns) and σn,λ2=2σ4Tr(Δ2)/(ns)2 with Δ=M1K112M1.

In practice, we estimate the variance σ2 via σ^2 defined as

σ^2=y(IA(λ))2yTr(IA(λ)),

where A(λ) = K(nsK2 + λQ)−1y, and (IA(λ))y is the residual yf^ based on the objective function in equation (15). The consistency of the variance estimate σ^2 is established in Theorem 3.4 in Gu (2013).

The proof of Theorem 7 is provided in Appendix and sketched below. Notice that Tn,λ = T1 + T2 − 2T3, where

T1=1nsϵTM1K112M1ϵ,T2=1nsK11M1S(STM1S)1STM1ϵ22,T3=1nsϵTM1S(STM1S)1STM1K112M1ϵ. (27)

We show that T2 and T3 are higher order small perturbation terms compared to T1. Thus, the null distribution of Tn,λ and the distribution of T1 are asymptotically equivalent. We only need to focus on the distribution of the quadratic form T1=1nsϵTΔϵ with ϵ having a mean zero normal distribution. To prove the normality of T1, we show that the log-characteristic function of the standardized T1 is asymptotically −σ2t2/2, provided that Tr(Δ2) diverges as λ → 0. Lemma 15 shows that Tr(Δ2)τ^λ, where τ^λ=max{i|μ^iλ} as the effective dimension (Bartlett et al., 2005; Liu et al., 2019) with μ^1μ^n the empirical eigenvalues of kernel matrix K11 which is the kernel matrix of H11 with (i, i′)th entry as 1nK11(xi1,xi1). We further show in Lemma 13 and 14 that τ^λ is of the same order as its population counterpart τλ defined as τλ = max{i | μiλ}, under both the quasi-uniform design and the uniform design, where μ1 ≥ ⋯ ≥ 0 are a sequence of ordered eigenvalues satisfying K11(x,x)=i=1μiϕi(x)ϕi(x). Since μi has a polynomial decay rate i−2m (Gu, 2013), we have Tr(Δ2)τ^λτλλ1/(2m) diverges as λ → 0. Consequently, the testing consistency in Theorem 7 holds.

Theorem 7 characterizes the distribution of the test statistic Tn,λ for fHmodelH11. The distribution turns out to be fairly simple and easy to calculate as the test statistic does not depend on any unknown nuisance functions such as f00, f10 and f01. Critical value can be easily found based on the known null distribution N(μn,λ,σn,λ2). Consequently, one can make a statistical decision by comparing Tn,λ with the critical value. This nuisance-parameter free property is referred to as the “Wilks phenomenon” in statistics literature (Fan et al., 2001; Fan and Zhang, 2004).

4.2. Upper Bound of the Distingushiable Rate

Given type I error α, we show that our Wald-type testing rule ϕn,λ=1(|Tn,λμn,λ|zα/2σn,λ) achieves the local minimax distinguishable rate. Without loss of generality, we assume fH1.

Theorem 8 Let the minimum distinguishable rate of the test ϕn,λ be dn(ϕn,λ). Suppose λ=O(nc1) for any fixed c ∈ (0, 1). Then for any δ > 0, there exist positive constants Cδ and Nδ such that, when nNδ, the tolerable pseudo.risk(ϕn,λ, dn) = α + δ, with dn(ϕn,λ)Cδλ+σn,λ.

Theorem 8 shows that for a controlled type I error, Tn,λ can achieve arbitrary small type II error provided that the local alternative is separated from the null by at least an amount of dn(ϕn,λ). The proof of Theorem 8 is collected in Appendix.

Note that dn2(ϕn,λ) consists of two components: σn,λ representing the standard variation of the test statistic Tn,λ, and λ representing the squared bias of f^1,2 (see the proof of Lemma S.1 in the Supplementary). Through approximating σn,λ by the Rademacher complexity (Bartlett et al., 2005; Liu et al., 2019), we show that σn,λτλ/n, which is a decreasing function of λ. Hence, the minimum distinguishable rate for ϕn,λ is achieved by the trade-off between the bias of f^1,2 and the standard derivation of Tn,λ, i.e., choosing appropriate λ such that λσn,λ. Next, we prove that our proposed Wald-type test is minimax under two special design conditions: the quasi-uniform design and the uniform design in the next two corollaries.

Corollary 9 [Quasi-Uniform Design] Let λn−4m/(4m+1) and suppose x〈1〉 follows a quasi-uniform design. We have

P(dn(ϕn,λ)n2m/(4m+1))14 exp(n1/(2m+1)).

Corollary 10 [Uniform Design] Let λn−4m/(4m+1), and suppose x〈1〉 follows a uniform design, we have

dn(ϕn,λ)n2m/(4m+1)  a.s.

Corollaries 9 and 10 suggest that if λn−4m/(4m+1), our Wald-type test ϕn,λ can achieve the minimax distinguishable rate dnn2m/(4m+1). Thus, we demonstrate that our proposed Wald type test is minimax optimal. We remark that Corollary 9 still holds when extending H1 as a standard Sobolev space.

4.3. The Choice of Regularization Parameter

Different from the classical “bias-variance” tradeoff in optimal nonparametric estimation, Theorem 8 states that the optimal nonparametric testing for Equation (3) can be achieved by another type of tradeoff between the squared bias of the estimator and the standard deviation of the test statistic. Such intrinsic difference further leads to different orders of optimal regularization parameters: as shown in Corollary 9, 10, the optimal λ is chosen as the order of n4m4m+1; while as the order of n2m2m+1 for optimal estimation (Gu, 2013).

In practice, cross validation method is often used as a tuning procedure for nonparametric estimation based on penalized loss functions (Golub et al., 1979). Raskutti et al. (2014) proposed another data-dependent algorithmic regularization technique, that is, choosing an early stopping rule for an iterative algorithm to avoid over-fitting in nonparametric estimation. Both of the above approaches are optimal for estimation but suboptimal for testing. There has few theoretically justified tuning procedure for obtaining optimal testing in nonparametric inference. One related work we are aware currently is Liu and Cheng (2018), under which they developed a data-dependent early stopping regularization rule from an algorithmic perspective for testing f = 0 in nonparametric regression model Y = f(X) + ϵ. The total step size determined via the early stopping rule in gradient descent algorithm plays the same role with 1/λ in the penalized regularization, to avoid over-fitting. However, a data-adaptive choice of the regularization parameter λ is still lacking for nonparametric inference in Equation (3) under the penalization regularization.

We propose a data-adaptive method to choose λ with testing optimality guarantee based on Theorem 8. In practice, we can choose the optimal smoothing parameter λ* satisfying

λ*=min {λ|λ<σn,λ}, (28)

where σn,λ can be explicitly calculated based on the observed data by the expression defined in Theorem 7, i.e., σn,λ2=2σ4Tr(Δ2)/(ns)2, with Δ=M1K112M1.

The above criterion in Equation (28) in choosing λ is a data-dependent rule that produces a minimax-optimal nonparametric testing method. Based on the Rademacher complexity, σn,λσ2nsi=1nmin{1,μ^i/λ}. That is, the rule in Equation (28) depends on the eigenvalues of the kernel matrix, especially the first few leading eigenvalues. There are many efficient methods to compute the top eigenvalues fast (Drineas and Mahoney, 2005; Ma and Belkin, 2017). As a future work, we can also introduce the randomly projected kernel methods to accelerate the computing time.

5. Simulation Study

To assess the performance of our proposed test, we carried out extensive analyses on simulated data sets. We compared our approach with F-test (SSF) (Ma et al., 2009), parallelism trend test (PTT) (Degras et al., 2011) and a random permutation test with 500 permutations. In the three methods, permutation test can be used as a benchmark because it can closely approximate null distribution when the number of permutations is adequate. However, the permutation test is computationally intensive, especially for calculating the Kullback-Leibler distance under the null and alternative hypothesis for SSANOVA model (Gu, 2004).

5.1. Empirical Power Analysis

We illustrate the empirical power performance of our proposed test through four well-designed examples. In all four examples, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1), where xi1~iidU(0,1) and ϵij~iidN(0,1). Each example was repeated 500 times for power and other comparisons. To make the simulation more close to the reality, we considered two types of nonparallel patterns between f(x〈1〉, 1) and f(x〈1〉, 0): magnitude and frequency. These two kinds of nonparallel patterns are often observed in real applications. For example, the hypermethylated DNA regions, i.e., regions with low methylation levels, are related to transcriptional silencing which plays an important role in cancer development; the frequency differences are often related to different brain functions between the neurodisease and control groups in fMRI studies. In the first four examples, we consider the following function in Equation (1),

f(x1,x2)={2.5 sin(3πx1)(1x1) if x2=0, i.e., control(2.5+δ1) sin((3+δ2)πx1)(1x1)(1+δ3) if x2=1, i.e., case (29)

where δ1, δ2 and δ3 control the magnitude of nonparallelism between the null hypothesis and the alternative hypothesis in Equation (18). In general, varying δ1, δ2 and δ3 give rise to different distinguishable rates dns. The larger the δ1, δ2 and δ3 are, the larger the dn is. To illustrate how the testing power is affected by different δ’s, as shown in Figure 3, we considered the following four settings. Setting 1: Case and control have constant magnitude differences (δ1 = 0.50, 0.75, 1.00 and δ2, δ3 = 0.00); Setting 2: Case and control have frequency differences (δ2 = 0.20, 0.30, 0.40 and δ1, δ3 = 0.00); Setting 3: Both magnitude and frequency are different (δ1, δ2 = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40) and δ3 = 0.00); Setting 4: Case and control have non-constant magnitude differences (δ1, δ2 = 0.00 and δ3 = 0.50, 0.75, 1.00). The corresponding functions f(x〈1〉, 0) and f(x〈1〉, 1) are shown in Figure 3.

Figure 3:

Figure 3:

Plotted here are functions of the control group (solid line) and case group (dashed, dotted and dot-dash lines) with four types of nonparallel patterns: magnitude differences only (Setting 1), frequency differences only (Setting 2), both magnitude and frequency differences (Setting 3), and magnitude dynamic differences (Setting 4).

The empirical powers of our proposed Wald-type test, permutation test, SSF test and PTT test are summarized in Tables 12 for Settings 1–2. For Setting 1, as shown in Table 1, the empirical power of our test increases rapidly as sample size increases, and approaches to 1 even for the smallest magnitude (δ1 = 0.50). The empirical powers of the proposed test are comparable with that of the permutation test. In contrast, the empirical powers of SSF and PTT increase slower than our proposed test. For the weak signal scenario, i.e., δ1 = 0.50, the proposed test has significantly gain of power under different sample sizes. For the strong signal scenario, i.e, δ = 1.00, our proposed test is significantly more powerful than SSF and PTT when sample size is less than 500. For Setting 2, as shown in Table 2, the empirical power of our proposed test converges to 1 as the sample size increases for all three cases with δ2 = 0.20, 0.30 and 0.40. In contrast, the empirical power of SSF and PTT converges to 1 slower than the proposed test.

Table 1:

Table lists the empirical power of our proposed test and permutation test for Setting 1 with δ1 = 0.50, 0.75, 1.00, δ2 = δ3 = 0.00 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ1 = 0.50 Proposed 0.17 0.33 0.49 0.59 0.69 0.75 0.86 0.91 0.92 0.96
Permutation 0.19 0.38 0.53 0.60 0.62 0.76 0.80 0.88 0.94 0.97
SSF 0.02 0.09 0.11 0.16 0.26 0.28 0.36 0.54 0.58 0.72
PTT 0.05 0.06 0.05 0.1 0.11 0.1 0.14 0.21 0.11 0.17
δ1 = 0.75 Proposed 0.37 0.67 0.90 0.93 0.97 0.98 1.00 1.00 1.00 1.00
Permutation 0.38 0.66 0.81 0.90 0.96 0.99 0.99 1.00 1.00 1.00
SSF 0.04 0.21 0.37 0.50 0.81 0.86 0.91 0.96 0.97 0.98
PTT 0.09 0.14 0.15 0.33 0.38 0.36 0.47 0.55 0.44 0.54
δ1 = 1.00 Proposed 0.61 0.92 0.97 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Permutation 0.57 0.89 0.95 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SSF 0.14 0.48 0.79 0.90 0.97 0.99 1.00 1.00 1.00 1.00
PTT 0.08 0.23 0.42 0.43 0.54 0.62 0.77 0.77 0.79 0.85

Table 2:

Table lists the empirical power of our proposed test and permutation test for Setting 2 with δ2 = 0.20, 0.30, 0.40, δ1 = δ3 = 0.00 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ2 = 0.20 Proposed 0.28 0.46 0.66 0.79 0.86 0.95 0.95 0.97 0.98 0.99
Permutation 0.27 0.43 0.59 0.74 0.86 0.94 0.94 0.98 1.00 1.00
SSF 0.02 0.05 0.21 0.32 0.48 0.62 0.79 0.84 0.88 0.95
PTT 0.04 0.03 0.04 0.08 0.11 0.14 0.12 0.09 0.16 0.26
δ2 = 0.30 Proposed 0.40 0.63 0.81 0.94 0.96 0.99 0.99 1.00 1.00 1.00
Permutation 0.36 0.64 0.79 0.89 0.97 0.98 0.99 1.00 1.00 1.00
SSF 0.03 0.13 0.35 0.52 0.72 0.85 0.91 0.97 0.99 1.00
PTT 0.03 0.08 0.09 0.15 0.31 0.23 0.28 0.4 0.35 0.4
δ2 = 0.40 Proposed 0.73 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Permutation 0.78 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
SSF 0.24 0.74 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00
PTT 0.11 0.16 0.18 0.38 0.39 0.52 0.56 0.59 0.81 0.89

For Settings 3 and 4, we only included the empirical results for our proposed test and SSF test due to the extremely high computational cost of the permutation test. As shown in Table 6, it takes more than 150 hours to complete the permutation test for one setting. For Setting 3, we simulated the signal with differences in both scale and frequency across case and control groups. The empirical powers of the simulation with different distinguishable parameters are listed in Table 3. The empirical powers of our proposed test and SSF increase for all the three cases with δ1, δ2 = 0.20, 0.30, 0.40. The empirical power of PTT also increases, but with a much slower pattern. When the sample size is small and signal strength is weak, our proposed test has significant gain of power compared to the SSF and PTT test. For Setting 4, there is a nonlinear magnitude difference along the x〈1〉 between the two groups. As shown in Table 4, the empirical power of SSF test converges to one slower than the proposed test and is lower than 0.65 for the least distinguishable case.

Table 6:

Table lists computational time (in hour) of running the simulation with 500 replications for our proposed test and the permutation test.

Sample Size
100 200 300 400 500 600 700 800 900 1000
Proposed 0.01 0.03 0.04 0.06 0.07 0.09 0.10 0.12 0.14 0.16
Permutation 3.22 6.14 9.29 13.29 17.93 22.26 26.74 31.26 36.57 42.23

Table 3:

Table lists the empirical power of our proposed test and permutation test for Setting 3 with δ1, δ2 = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40), δ3 = 0 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ1 = 0.50 Proposed 0.35 0.51 0.74 0.86 0.91 0.95 0.97 0.98 1.00 1.00
δ2 = 0.20 SSF 0.04 0.15 0.29 0.41 0.57 0.72 0.85 0.89 0.91 0.96
PTT 0.03 0.07 0.07 0.08 0.08 0.06 0.15 0.19 0.21 0.2
δ1 = 0.75 Proposed 0.42 0.70 0.86 0.96 0.99 1.00 1.00 1.00 1.00 1.00
δ2 = 0.30 SSF 0.05 0.26 0.46 0.64 0.79 0.93 0.94 0.95 1.00 1.00
PTT 0.04 0.07 0.11 0.15 0.19 0.23 0.31 0.29 0.43 0.46
δ1 = 1.00 Proposed 0.72 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
δ2 = 0.40 SSF 0.25 0.72 0.97 0.99 1.00 1.00 1.00 1.00 1.00 1.00
PTT 0.11 0.19 0.22 0.32 0.52 0.5 0.64 0.61 0.73 0.69

Table 4:

Table lists the empirical power of our proposed test and permutation test for Setting 4 with δ3 = 0.50, 0.75, 1.00, δ1 = δ2 = 0.00 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ3 = 0.50 Proposed 0.15 0.33 0.47 0.58 0.66 0.75 0.83 0.88 0.89 0.94
SSF 0.01 0.04 0.07 0.16 0.18 0.28 0.35 0.47 0.57 0.64
PTT 0.06 0.03 0.08 0.09 0.09 0.14 0.07 0.13 0.08 0.13
δ3 = 0.75 Proposed 0.35 0.61 0.73 0.84 0.92 0.95 0.99 1.00 1.00 1.00
SSF 0.03 0.12 0.18 0.34 0.56 0.70 0.83 0.86 0.96 0.96
PTT 0.01 0.07 0.06 0.07 0.09 0.12 0.13 0.18 0.18 0.24
δ3 = 1.00 Proposed 0.42 0.70 0.85 0.95 0.99 0.99 1.00 1.00 1.00 1.00
SSF 0.07 0.20 0.52 0.76 0.82 0.92 0.98 0.98 1.00 1.00
PTT 0.09 0.04 0.08 0.10 0.18 0.18 0.21 0.24 0.25 0.28

5.2. Empirical Size Analysis

To examine the approximation of significance levels, we generated data from a new setting Setting 5. We kept the function form of control group the same as Equation (5.4) and only added a parallel shift over the control function as the function of the case group, i.e., the model does not include the nonparallel patterns. In particular,

f(x1,x2)=2.5 sin(3πx1)(1x1)+δ4I{x2=1},

where δ4 was set to be 0, 0.5 and 1 to characterize different level parallel difference in the two groups. We generated data from Equation (1) with function f specified in Setting 5. The rest of parameters were set the same as before.

Table 5 lists the empirical sizes of our proposed test, permutation test, SSF test, and PTT under Setting 5. We varied δ4 from 0.00 to 1.00 to model different magnitudes of the main effect. The empirical size of our proposed test approaches to 0.05 as the sample size increases for different values of δ4. The empirical size of SSF test is fluctuating from 0.03 to 0.1. The inaccurate size of the SSF test may be attributed to the fact that the degrees of freedom of the SSF test is very roughly approximated by the rounding value of the trace of the smoothing matrix. The empirical size of PTT test is fluctuating from 0.02 to 0.12.

Table 5:

Table lists the empirical sizes of the proposed test, permutation test, SSF, and PTT for δ4 = 0.00, 0.50, 1.00 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ4 = 0.00 Proposed 0.04 0.07 0.06 0.06 0.05 0.06 0.06 0.07 0.06 0.05
Permutation 0.04 0.08 0.05 0.08 0.06 0.05 0.06 0.04 0.07 0.06
SSF 0.06 0.11 0.03 0.08 0.08 0.03 0.07 0.09 0.07 0.03
PTT 0.03 0.05 0.02 0.02 0.03 0.02 0.12 0.09 0.08 0.06
δ4 = 0.50 Proposed 0.06 0.05 0.05 0.06 0.06 0.05 0.06 0.04 0.05 0.06
Permutation 0.07 0.04 0.05 0.06 0.08 0.09 0.04 0.03 0.05 0.04
SSF 0.06 0.05 0.07 0.06 0.07 0.08 0.07 0.04 0.04 0.07
PTT 0.02 0.02 0.03 0.03 0.07 0.04 0.06 0.07 0.06 0.04
δ4 = 1.00 Proposed 0.07 0.06 0.07 0.06 0.05 0.05 0.06 0.06 0.06 0.05
Permutation 0.04 0.06 0.03 0.05 0.05 0.04 0.03 0.02 0.04 0.04
SSF 0.07 0.07 0.08 0.06 0.04 0.07 0.06 0.09 0.07 0.04
PTT 0.03 0.04 0.03 0.05 0.05 0.04 0.06 0.05 0.05 0.08

5.3. Computation Time

As shown in Tables 1 and 2, our purposed test achieves the power similar to the permutation test. Next, we compared the computation time of our proposed test and permutation test for 500 replicated samples. We conducted the comparison on a computer workstation with core Intel i7 8700k CPU and 32 Gb RAM. In Table 6, we reported the computational time in Setting 1 with δ1 = 0.5 and sample size ranging from 100 to 1000. As shown in Table 6, our proposed test is consistently faster than the permutation test. Our proposed test is nearly 263×faster than the permutation test when the sample size is 1000. Note that the computational time is more than 42 hours when the sample size is 1000 for running 500 test. In practice, the huge computational cost limits the application of the permutation test in many large scale studies involving large sample size and multiple tests.

5.4. Simulation Studies with Correlated Noise

We established Setting 6 to evaluate the performance of the proposed test when the noises are correlated. In this example, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1). We considered xi1, i = 1, …, n are evenly distributed in [0, 1]. We generated two correlated noise vector (ϵ11, …, ϵn1) and (ϵ12, …, ϵn2) i.i.d. from N(0, Σ) where Σ is autoregressive, i.e., each of its element σii = ρ|ii′| with ρ = 0.5. We generated the signal Yij=f(xi1,xj2)+ϵij where f is defined in Equation (5.4) with δ1 = 0.00, 0.50, 0.75, 1.00 and δ2, δ3 = 0.00, that is,

f(x1,x2)={2.5 sin(3πx1)(1x1) if x2=0,(2.5+δ1) sin(3πx1)(1x1) if x2=1.

We set the significance level as 0.05 and repeated 500 times for evaluating the empirical size and power.

As shown in Table 7, when δ1 = 0.00, the size of our proposed method concentrates around 0.05 − 0.07, while the sizes of SSF and PTT are fluctuating from 0.02 to 0.16. When δ1 > 0.00, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as δ1 increases.

Table 7:

Table lists the empirical size (δ1 = 0) and power (δ1 = 0.50, 0.75, 1.00) of our proposed test, SSF and PTT for Setting 6 with δ2 = δ3 = 0.00 and sample size ranging from 100 to 1000.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ1 = 0.00 Proposed 0.08 0.04 0.06 0.06 0.06 0.08 0.10 0.07 0.06 0.07
SSF 0.08 0.06 0.06 0.09 0.10 0.05 0.09 0.14 0.08 0.06
PTT 0.02 0.1 0.04 0.08 0.11 0.05 0.16 0.12 0.13 0.06
δ1 = 0.50 Proposed 0.21 0.33 0.48 0.57 0.73 0.73 0.82 0.91 0.94 0.96
SSF 0.01 0.05 0.10 0.17 0.29 0.32 0.46 0.48 0.63 0.72
PTT 0.13 0.22 0.35 0.50 0.51 0.53 0.72 0.73 0.78 0.86
δ1 = 0.75 Proposed 0.66 0.89 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00
SSF 0.13 0.48 0.74 0.89 0.98 1.00 0.99 1.00 1.00 1.00
PTT 0.16 0.32 0.41 0.43 0.66 0.67 0.73 0.85 0.85 0.89
δ1 = 1.00 Proposed 0.93 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
SSF 0.47 0.93 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
PTT 0.16 0.41 0.55 0.62 0.69 0.85 0.86 0.90 0.90 0.95

5.5. Simulation Studies with Non-smooth Cases

We evaluate the robustness of the proposed method when the smoothness assumption is invalid. We established Setting 7 to test the performance of the proposed test for the cases with non-smooth trends. In this setting, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in model (1). We considered xi1, i = 1, …, n, are evenly distributed in [0, 1] and ϵij~iidN(0,1). We generated the signal Yij=f(xi1,xj2)+ϵij with f defined as

f(x1,x2)=2.5 sin(2πx1)I{x1(0,0.5)}+(1+δ5I{x2=1})(x1)I{x1[0.5,0)}

which is shown in Figure 4. This curve is non-differentiable at x〈1〉 = 0.5 which is a change point from nonlinear to linear trend. We set the significance level as 0.05 and repeated 500 times to evaluate the empirical size and power.

Figure 4:

Figure 4:

Solid line with δ5 = 0: function of the control group; dashed and dotted lines with δ5 = 1, 2: the case group for Setting 7.

As shown in Table 8, when δ5 = 0.00, the empirical size of our proposed method concentrates around 0.05. The empirical size of our proposed method is slightly inflated compared with SSF and PTT. When δ5 = 1, 2, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as n increases.

Table 8:

Table lists the empirical size (δ5 = 0) and power (δ5 = 1.00, 2.00) of our proposed test, SSF and PTT for Setting 7.

Sample Size
100 200 300 400 500 600 700 800 900 1000
δ5 = 0.00 Proposed 0.08 0.04 0.06 0.06 0.06 0.08 0.10 0.07 0.06 0.07
SSF 0.01 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00
PTT 0.02 0.06 0.04 0.08 0.03 0.05 0.03 0.06 0.03 0.03
δ5 = 1.00 Proposed 0.21 0.33 0.48 0.57 0.73 0.73 0.82 0.91 0.94 0.96
SSF 0.03 0.02 0.06 0.07 0.09 0.15 0.23 0.29 0.34 0.37
PTT 0.01 0.05 0.01 0.02 0.04 0.02 0.04 0.04 0.08 0.06
δ5 = 2.00 Proposed 0.66 0.89 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00
SSF 0.07 0.24 0.46 0.63 0.76 0.87 0.91 0.98 0.97 0.99
PTT 0.02 0.04 0.02 0.09 0.03 0.06 0.07 0.10 0.06 0.06

6. Real Data Examples

We apply the technique to analyze two real data sets: DNA methylation in chronic lymphocytic leukemia and neuroimaging of Alzheimer’s Disease using fMRI.

6.1. DNA Methylation in Chronic Lymphocytic Leukemia

Recently, Filarsky et al. (2016) reported a DNA methylation study for chronic lymphocytic leukemia (CLL) patients. In the study, the DNA samples were extracted from CD19+ cells from 12 CLL patients and B cells from 6 normal subjects. The DNA methylation is profiled by the whole-genome tiling array technique. The goal is to identify differentially methylated regions (DMRs), i.e., the genome regions that have significantly different methylation levels, between CLL patients and normal subjects.

To achieve this goal, we compiled the DNA methylation intensities within the −3.8 to +1.8 kb of transcription start sites (TSS) for each gene. We used the M-value suggested by Irizarry et al. (2008) as methylation level at each site and as our response variable. In particular, the data consists of (Yij,xi1,xj2), where Yij is the methylation level at the ith genome location xi1 of the jth subject in group xj2, which equals to 1 if the jth subject is in the case group and equals to 0 if the jth subject is in the control group. We fit the model in Equation (1) with SSANOVA decomposition in Equation (2) to the data.

We applied the proposed hypothesis testing on 10383 regions. Through controlling FDR < 0.01 using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), we selected 613 DMRs. We conducted gene ontology analysis on the 613 genes corresponding 613 identified DMRs using the GSEA (Subramanian et al., 2005). Among these genes, 79 genes participate the lipid metabolic process, which plays an important role in the development of CLL (Pallasch et al., 2008). This biological process contributes to apoptosis resistance in CLL cells. Furthermore, 78 and 61 genes participate the immune related biological processes: “Immune system process” and “Regulation of immune system process” respectively. The observation indicates that the aberrant DNA methylation has the potential impact on the immune system.

Our Wald-type test, even after FDR control, yields p-values that are as small as 10−9. Consequently, it is very difficult to compare our test with the permutation test with only hundreds or thousands of permutations. Thus, we only compared our proposed test with permutation test (based on 500 permutation) for regions with p-values larger than 0.05 the averaged difference between our test and permutation test is 0.012.

We highlighted two DMRs with significant nonparallel patterns in Figure 5. The focal hypermethylation at genome locations 42574000 and 42576500 are observed on the promoter region of gene MTA3. It was reported in (Bilban et al., 2006) that MTA3 signaling pathway is a potential bio-marker for CLL and shows significantly altered gene expression. Our test also identified that the methylation levels between CLL patients and normal subjects, of MTA3 gene have significant difference, which has potential prognostic value. In the promoter region of DNMT3, we observed significant hypomethylation at genome location 25244500. DNMT3 is a family of DNA methyltransferases that could methylate hemimethylated and unmethylated CpG sites at the same rate (Okano et al., 1998). Since the global hypomethylation is observed, the aberrant methylation levels of this DNA methylatransferase may have influence on this global trend.

Figure 5:

Figure 5:

The promoter regions of two genes, (a) MTA3 and (b) DNMT3A. The horizontal axis is the genomic location and the y axis is the M-value representing the methylation levels. The red and blue lines are the fitted curves for the case and control groups respectively.

6.2. Neuroimaging of Alzheimer’s Disease using fMRI

Alzheimer’s disease (AD) is one of the most commonly known neurology disease characterized with neurodegeneration and cognitive decline (Rombouts et al., 2005; Wang et al., 2006). Despite the prevalence of AD, there are no cure or preventive methods available due to the lack of a complete understanding of the mechanisms that contribute to AD pathophysiology. Discovering aberrant neural network of AD will fundamentally advance the scientific understanding of this disease.

In this study, we analyzed the data that was collected by Alzheimer’s Disease Neuroimaging Initiative (ADNI)2, in which the resting-state fMRI signals of 60 normal/early-mild-cognitive-impairment subjects (control group) and 50 AD/late-mild-cognitive-impairment subjects (AD group) were collected from 256×256×170 voxels for 140 consecutive time points with equal time intervals of 30ms. The fMRI signals for each subject were preprocessed using fMRI Expert Analysis Tool (FEAT) (Smith et al., 2004) for skull-stripping, motion correction, slice timing correction, temporal filtering, spatial smoothing and registration to standard space (MNI152 T1 2mm model) so that signals from all subjects can be considered as from the same engineered brain template. Sixty-nine brain-region-of-interests (ROI) that are defined by Harvard-Oxford-Atlas (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases) was extracted by automatic regional labeling approach using the refined fMRI data. For each ROI, we consider model (1) with SSANOVA decomposition in Equation (2), where Yij records the average blood-oxygen-level (Huettel et al., 2004) of the brain region for subject j measured at the xi1 time point. As the blood-oxygen-level can accurately quantify the corresponding brain activity, we can detect abnormal AD related brain activity. Testing problem in Equation (18) is equivalent to testing whether the brain activities of a given ROI have different temporal patterns in case and control groups.

Seven cortical regions parahippocampal gyrus, cingulate gyrus, inferior temporal gyrus, post-central gyrus, juxtapositional lobule cortex, precuneous cortex, central opercular cortex and one sub-cortical region right thalamus with significantly different temporal patterns were identified using our test with the false discovery rate controlled at 5% using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Among the eight ROIs, parahippocampal gyrus and cingulate gyrus have been shown clinically to be risk factors for AD. As demonstrated in Echávarri et al. (2011) and Kesslak et al. (1991), parahippocampal gyrus of AD patients have significant atrophy. Meanwhile, cingulate gyrus was also found to be AD related (Scheff et al., 2015) due to its extensive connectivity with multiple different cortical areas, especially areas involved with learning and memory. In Figure 6, we plotted frontal, axial, and lateral views and corresponding temporal patterns of parahippocampal gyrus and cingulate gyrus. The temporal regions with significant difference between AD/late-mildcognitive-impairment subjects (red line) and normal/early-mild-cognitive-impairment subjects (blue line) are highlighted. As clearly demonstrated in lower left panel of Figure 6, the first highlighted area of parahippocampal gyrus has a significant reversed pattern between case group and control group. The second highlighted area shows the reduced levels for the AD group. For cingulate gyrus, the highlighted regions in the right panel of Figure 6 show clearly larger magnitude for the AD groups. This difference was also observed via fMRI in a visual encoding memory task (Rami et al., 2012). Both of the two experiments suggest that the difference may change the memory function.

Figure 6:

Figure 6:

Plotted here are blood-oxygen-levels of parahippocampal gyrus (left) and cingulate gyrus (right) for control group (blue) and AD group (red) observed at 140 time points. Physical locations of either ROIs on frontal, axial and lateral sides are illustrated on the top of each panel.

7. Discussion

The hypothesis testing in SSANOVA is a very challenge problem. In this paper, we develop a Wald-type test for testing the significance of the nonparallelism in a two-way SSANOVA model. The optimality of the proposed test is justified by the minimax distinguishable rate. The extensive empirical studies suggest that the proposed test has a superior performance over existing methods. Although we only discuss the test of the significance of the nonparallelism in a two-way SSANOVA model, the test on a higher order SSANOVA model can be developed parallel to our framework.

Supplementary Material

1

Acknowledgments

PM was funded in part by NSF DMS-1440037, 1438957, 1925066 and NIH 1R01GM122080-01. WZ was funded in part by NSF DMS-1440038, 1903226, 1925066 and NIH 1R01GM113242-01. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Appendix A. Proof of Main Results

In this section, we present main proofs of the theorems and lemmas in the main text.

A.1. Notation Clarification

We rewrite (16) as

f^11=K11M1(InsS(STM1S)1STM1)y,

where

S=[In1w000In1w1][1n1n1n0],K11=12[K11K11K11K11]

and

M=[In1w000In1w1](θ102[K11K11K11K11]+θ112[K11K11K11K11]+λI2n)[In1w0T00In1w1T],

K11 is the kernel matrix of H11 with (i, i′)th entry as 1nK11(xi1,xi1), w0 is the number of subjects in control group, w1 is the number of subjects in case group, and ⊗ denotes the Kronecker product. Based on Chapter A.3 in Gu (2013), we set θ101Tr(K10) and θ111Tr(K11) with θ10 + θ11 = 1.

In the following theoretical derivation, we only focus on the case with s = 2, i.e. w0 = w1 = 1. If we have s > 2 subjects, the proof can be easily generalized to this situation by replacing Equation (15) by the penalized weighted least squares; see Section 3.2.4 in Gu (2013).

A.2. Proofs for Section 3

A.2.1. Preliminary

We identify a sequence model that is equivalent to our nonparametric model (1) with SSANOVA decomposition in Equation (2). Let {ρi,ϕi}i=1 be pairs of eigenvalue and eigenfunction in H1 and {νj,ψj}j=12 be pairs of eigenvalue and eigenfunction in H2. In the tensor product space H=H1H2, as shown in Lin (2000), eigenvalues and eigenfunctions are {μiνj, ϕiψj}i=1,…,∞, j=1,2. Model (1) is equivalent to a sequence model

zij=θij+ωij, (30)

where θij=12x2=01X1f(xi1,xj2)ϕi(x1)ψj(x2)dω(x1) are the basis expansion coefficients, the random noise ωij is mean zero and variance σ2/n. The space E={fH:fH<1} in Equation (30) is equivalent to E={i=1j=12θij2(μiνj)1}. The hypothesis in Equation (18) is equivalent to the hypothesis

H0:θi2=0 for i=2,,n.

Let θ11 = (θ22, θ32, …, θn2)T, and E11={θ11|i=2nθi22(μiν2)1}. Consider a local alternative H1n:θ11E11 with ∥θ112dn, where dn represents a generic distinguishable rate. The total error of a generic testing rule ϕn under distinguishable rate dn can be rewritten as

pseudo.risk(ϕn,dn)=EH0{ϕn|H0 is true}+supθ11E11θ112dnE{1ϕn|H1 is true}. (31)

Equation (31) is consistent with the testing error defined by Ingster (1993), Wei and Wainwright (2020). For the simplicity of description, we order the axis length {(μiν2)}i=2 from the smallest to the largest as {ρp}p=1. Next we introduce a lemma to give a low bound of the minimum pseudo risk.

Lemma 11 For every set C and probability measure Q supported on CBc(dn), we have

inf ϕnpseudo.risk(ϕn,dn)112Eη,η exp(η,ησ2)1

where Eη,η denotes expectation with respect to an i.i.d. pair η, η~

The proof of this lemma directly follows Lemma 3 in Wei and Wainwright (2020).

A.2.2. Proof of Lemma 4

Proof As shown in Lemma 11, we have

inf ϕnpseudo.risk(ϕn,dn)112Eη,η exp(η,ησ2)1 (32)

Next we show that if δ2kB(δ)σ24, we have the last term in Equation (32) larger than 1/2. Let θb=δki=1kbiei where ei is the standard basis vector with ith coordinate as one. We consider Q as the uniform distribution on {θb, b ∈ {−1, 1}k}. The expectation in the last term of Equation (32) can be written as

Enη,η exp(nη,ησ2)=12kb,bexp(nθbTθbσ2)=12kb,bexp(nδ2i=1kbibikσ2)=12k(exp(nδ2kσ2)+exp(nδ2kσ2))k(i)(1+n2δ4k2σ4)k(ii)exp(n2δ4kσ4),

where (i) is due to that 12(exp(x)+exp(x))1+x2 for |x| ≤ 1/2 and (ii) is due to that 1+xex. Thus for any δ4kσ416n2, we have

inf ϕnpseudo.risk(ϕn,dn)112e1/1611/2.

By the definition of rB, we have pseudo.risk(ϕn, dn) > 1/2 for all dnrB. ■

A.2.3. Proof of Lemma 5

Proof We show that bk,2(E11) is bounded below by ρk+1. It is sufficient to show that E11 contains a l2 ball centered at f11 = 0 with radius ρk+1. For any vE11 with v2ρk+1, we have

b2,k(i)i=1k+1vi2ρi(ii)1μk+1i=1k+1vi2,

where inequality (i) holds by set the (k + 1)-dimensional subspace spaned by the eigenvectors corresponding to the first (k + 1) largest eigenvalues; inequality (ii) holds by the decreasing order of the eigenvalues, i.e., ρ1ρ2 ≥ … ρk+1.

Recall that the definition of the Bernstein lower critical dimension is kB(δ)=arg maxk{bk1,22(E11)δ2}, we have

kB(δ)arg maxk{ρkδ}.

A.2.4. Proof of Theorem 6

Proof By Lemme 4, we have

dnsup{δ:kB(δ)16n2δ4}.

We plug in the lower bound of kB(δ) in Lemma 5. Then we have

dnsup{δ:arg maxk{ρkδ}16n2δ4}. (33)

The eigenvalues have polynomial decay rate i.e., ρpp−2m, and consequently, arg maxk{ρkδ}δ1/m. Plugging this into Equation (33), it is easy to see that the supremum on the right hand side has an order n2m4m+1. Proof is thus completed. ■

A.3. Proof of Theorem 7

Before deriving the proof of Theorem 7, we first state Lemma 12, Lemma 13, Lemma 14, and Lemma 15, which are used in the proof of Theorem 7. The proof of these auxiliary lemmas is referred to the Supplementary.

A.3.1. Some Auxiliary Lemmas

Lemma 12 shows the projection of f10 on H11Hmodel is zero. This result indicates our test statistic does not depend on the nuisance parameter f10.

Lemma 12 The quantity, K11M−1(InS(ST M−1S)−1ST M−1)f10, equals to zero.

The next two lemmas show the equivalence of τλ and τ^λ under the quasi-uniform design and uniform design.

Lemma 13 If x〈1〉 follows the quasi-uniform random design, for any λ=1n1c, m > 3/2, and any δ, c > 0, we have

P(τ^λτλ)1(n22m12δ+n12m1) exp{cn2m32m1+2δ},

where τλ = max{i | μiλ} and τ^λ=max{i|μ^iλ}.

Lemma 14 If x〈1〉 follows the uniform fixed design condition, for m > 1/2 and λ > 0, we have

τ^λτλ.

In the following lemma, we bound Tr(Δ) by a function of τ^λ. This result is essential in deriving the asymptotic distribution of Tn,λ.

Lemma 15 For Δ=M1K112M1 defined in Theorem 7, we have

4τ^λ9Tr(Δ)4(1θd)2(τ^λ+12λi=τ^λ+1nμ^i). (34)

A.3.2. Proof of Theorem 7

Proof For simplicity, we suppose σ2 = 1. We define the three terms on the right-hand side of Equation (26) as T1, T2 and T3, i.e.,

T1=1nϵTΔϵ,T2=1nϵTM1S(STM1S)1STΔS(STM1S)1STM1ϵ,T3=1nϵTM1S(STM1S)1STΔϵ.

We now show T2 and T3 are in smaller order compared to T1. First, we analyze the second term T2 in Equation (26). We have

E[T2]=1nE[ϵTM1S(STM1S)1STΔS(STM1S)1STM1ϵ]=1nTr(M1S(STM1S)1STΔS(STM1S)1STM1)2nλmax(Δ)λmax(M1S(STM1S)1STS(STM1S)1STM1)2nλmax(Δ),

where λmax denotes the largest eigenvalue. Since all eigenvalues of Δ are less than 1, we have E[T2]2n. Analogously, we can derive the variance inequality of T2. Combining the results together and using the Chebyshev inequality, we have

T2=Op(1n). (35)

Second, we analyze the third term T3 in Equation (26). We apply the Cauchy-Schwarz inequality and have

|T3|T2T1. (36)

Finally, we derive the magnitude of T1. We first consider the testing consistency of T1 conditional on X. Denote Eϵ as the expectation with respect to ϵ, and define Varϵ as the variance with respect to ϵ. Note that

Eϵ[ϵTΔϵ]=Tr(Δ),Varϵ[ϵTΔϵ]=2Tr(Δ2).

Let Z=(ϵTΔϵTr(Δ))/2Tr(Δ2) and t ∈ (−1/2, 1/2). Then the log-characteristic function of Z can be written as

log Eϵ[exp(itZ)]=log Eϵ[exp(itϵTΔϵ/2Tr(Δ2))]itTr(Δ)/2Tr(Δ2)=12log det{I2n2itΔ/2Tr(Δ2)}itTr(Δ)/2Tr(Δ2). (37)

Through Taylor expansion, one has

12log det{I2n2itΔ/2Tr(Δ2)}=itTr(Δ)2Tr(Δ2)t2Tr(Δ2)2Tr(Δ2)+O(t3Tr(Δ3)[Tr(Δ2)]3/2). (38)

Combining Equations (37) and (38), we have

log Eϵ[exp(itZ)]=t22+O(t3Tr(Δ3)[Tr(Δ2)]3/2). (39)

Since all eigenvalues of Δ are less than 1, we have Tr(Δ3)Tr(Δ2)1. Analogous to (S.11), we have

Tr(Δ2)1681τ^λ. (40)

Under the quasi-uniform design, we have Tr(Δ2) → ∞ as λ → 0 with probability approaching 1 by Lemma 13 and Equation (40). Hence, the second term on the right-hand side of Equation (39) is op(1). We thus conclude that

Eϵ[exp(itZ)]Pexp(t22).

Next, we show that

E[exp(itZ)]=EX[Eϵ[exp(itZ)]]exp(t2/2)

for t(12,12). If not, there exists a subsequence of r.v Xnk1, such that for ∀ε > 0, |EXnk1Eϵ exp(itZ)exp(t2/2)|>ε. On the other hand, since Eϵ exp(itZ(Xnk1))Pexp(t2/2), which is bounded, there exists a sub-sub sequence {Xnkl1}, such that Eϵ exp(itZ(Xnkl1))a.sexp(t2/2). Then by dominate convergence theorem, EXnkl1Eϵ exp(itZ)exp(t2/2), which is a contradiction. Under the uniform design, we can easily obtain E[exp(itZ)]exp(t22) by Lemma 14 and Equation (40).

Thus Z is asymptotically normally distributed, and

T1Tr(Δ)/n2Tr(Δ2)/n2dN(0,1). (41)

Combining (35), (36) and (41), the theorem follows. ■

A.4. Proof of Theorem 8

Proof Under the alternative hypothesis, the statistic Tn,λ in Equation (26) can be decomposed into three terms as follows

Tn,λ=1nHϵ22+1nHf1122+2nf11THTHϵ. (42)

where H = θ11K11M−1(IS(STM−1S)−1STM−1). Let W1=1nHϵ22, W2=1nHf1122, and W3=2nf11THTHϵ denote corresponding three terms on the right-hand side of Equation (42).

We now derive a lower bound for W2. By Lemma S.1, we have

1nHf11f11221nHf11f1122+1nHf10f1022=1nHf10+Hf11f10f1122=g˜*g*n2cλ. (43)

Let c=c, we consider the distinguishable rate

1nf1122=f11n2>c2dn2=c(λ+σn,λ). (44)

where the inequality is satisfied since ∥ · ∥n dominates ∥ · ∥2 by Lemma S.2. The lower bound of W2 is thus,

W2=1nHf1122=1nf11221nf11Hf1122cdn2cλcσn,λ. (45)

where the first inequality is obtained by (43) and the second inequality is obtained through plugging in Equation (44).

For the third term W3, it is seen that EW3=0. It is easy to verify that the eigenvalues of HHT are all less than 1. Moreover,

EW32=4n2E[f11THTHϵϵTHTHf11]=4n2(Hf11)THHT(Hf11)4n2(Hf11)T(Hf11)=4nW2.

By the Chebyshev’s inequality, for any ϵ > 0, we have

(|W3|2ϵ12W212n)nEW324ϵ1W2ϵ.

Consequently, there exists an n0, for any n > n0, we have

{|W3|>12W2}(|W3|2ϵ12W212n)ϵ. (46)

Now, we are ready to prove our theorem. By the triangle inequality, we have

|W1μn,λσn,λ+W2+W3σn,λ||W2+W3σn,λ||W1μn,λσn,λ| (47)
|W2σn,λ||W3σn,λ||W1μn,λσn,λ|. (48)

If |W1μn,λ|σn,λCϵ, and |W3|12W2 hold, in view of (47) and Equation (45), we have

|W1μn,λσn,λ+W2+W3σn,λ|12cCϵ.

Noting that W1 is identical to Equation (26), by Theorem 7, we have |W1μn,λ|σn,λ=Op(1). That is for any Cϵ > 0, there exists an s, for any n > s, we have

(|W1μn,λ|σn,λ>Cϵ)ϵ. (49)

Setting c2(Cϵ+z1α2) and N = max(n0, s), for any n > N, we have

(ϕn,λ=1)={|W1+W2+W3μn,λ|σn,λz1α2}{|W1μn,λ|σn,λCϵ,|W3|12W2}1{|W1μn,λ|σn,λ>Cϵ}{|W3|>12W2}12ϵ.

where the second inequality is due to the Boole’s inequality (Casella and Berger, 2002) and the last inequality is obtained by combining Equation (45) and Equation (49). Thus, we have

sup H1*E(1ϕn,λ|H1* is true)<δ,

where H1*={f|fHmodel and f112Cδλ+σn,λdn}. ■

Footnotes

1.

The mth order Sobolev space is defined as H1={η1L2[0,1]|η1(k) is absolutely continuous for k=0,1,,m1}.

2.

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

References

  1. Abramowitz Milton and Stegun Irene A. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC., 1964. [Google Scholar]
  2. Alaoui Ahmed and Mahoney Michael W. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 775–783. 2015. [Google Scholar]
  3. Bartlett Peter L, Bousquet Olivier, and Mendelson Shahar. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005. [Google Scholar]
  4. Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. [Google Scholar]
  5. Bilban Martin, Heintel Daniel, Scharl Theresa, Woelfel Thomas, Auer Michael M, Porpaczy Edit, Kainz Birgit, Krober Alexander, Carey Vincent J, Shehata Medhat, Zielinski C, Pickl W, Stilgenbauer S, Gaiger A, Wagner O, Jager U, and German CLL Study Group. Deregulated expression of fat and muscle genes in b-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia, 20(6):1080–1088, 2006. [DOI] [PubMed] [Google Scholar]
  6. Braun Mikio L. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(Nov):2303–2328, 2006. [Google Scholar]
  7. Casella George and Berger Roger L. Statistical inference. Duxbury Pacific Grove, CA, 2nd edition, 2002. [Google Scholar]
  8. Degras David, Xu Zhiwei, Zhang Ting, and Wu Wei Biao. Testing for parallelism among trends in multiple time series. IEEE Transactions on Signal Processing, 60(3):1087–1097, 2011. [Google Scholar]
  9. Drineas Petros and Mahoney Michael W. On the nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153–2175, 2005. [Google Scholar]
  10. Echávarri C, Aalten P, Uylings HBM, Jacobs HIL, Visser PJ, Gronenschild EHBM, Verhey FRJ, and Burgmans S. Atrophy in the parahippocampal gyrus as an early biomarker of alzheimer’s disease. Brain Structure and Function, 215(3–4):265–271, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Eggermont Paulus Petrus Bernardus and LaRiccia Vincent N. Maximum penalized likelihood estimation, volume II. Springer, 2001. [DOI] [PubMed] [Google Scholar]
  12. Fan Jianqing and Zhang Jian. Sieve empirical likelihood ratio tests for nonparametric functions. Ann. Statist, 32(5):1858–1907, 10 2004. [Google Scholar]
  13. Fan Jianqing, Zhang Chunming, and Zhang Jian. Generalized likelihood ratio statistics and wilks phenomenon. Ann. Statist, 29(1):153–193, 02 2001. [Google Scholar]
  14. Filarsky Katharina, Garding Angela, Becker Natalia, Wolf Christine, Zucknick Manuela, Claus Rainer, Weichenhan Dieter, Plass Christoph, Hartmut Döhner Stephan Stilgenbauer, Lichter Peter, and Mertens Daniel. Krüppel-like factor 4 (klf4) inactivation in chronic lymphocytic leukemia correlates with promoter dna-methylation and can be reversed by inhibition of notch signaling. Haematologica, 101(6):249, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Giné Evarist and Nickl Richard. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2015. [Google Scholar]
  16. Golub Gene H, Heath Michael, and Wahba Grace. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979. [Google Scholar]
  17. Gu Chong. Model diagnostics for smoothing spline ANOVA models. Canadian Journal of Statistics, 32(4):347–358, 2004. [Google Scholar]
  18. Gu Chong. Smoothing spline ANOVA models. Springer, 2nd edition, 2013. [Google Scholar]
  19. Hansen Kasper D, Langmead Benjamin, and Irizarry Rafael A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10):R83, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Huettel Scott A, Song Allen W, and McCarthy Gregory. Functional magnetic resonance imaging. Sinauer Associates; Sunderland, 2004. [Google Scholar]
  21. Ingster Yuri and Suslina Irina A. Nonparametric goodness-of-fit testing under Gaussian models. Springer Science & Business Media, 2012. [Google Scholar]
  22. Ingster Yuri I. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Math. Methods Statist, 2(2):85–114, 1993. [Google Scholar]
  23. Irizarry Rafael A, Ladd-Acosta Christine, Carvalho Benilton, Wu Hao, Brandenburg Sheri A, Jeddeloh Jeffrey A, Wen Bo, and Feinberg Andrew P. Comprehensive high-throughput arrays for relative methylation (charm). Genome Research, 18(5):780–790, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kesslak J Patrick, Nalcioglu Orhan, and Cotman Carl W. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in alzheimer’s disease. Neurology, 41(1):51–51, 1991. [DOI] [PubMed] [Google Scholar]
  25. Kim Young-Ju and Gu Chong. Smoothing spline gaussian regression: more scalable computation via efficient approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):337–356, 2004. [Google Scholar]
  26. Lin Yi. Tensor product space ANOVA models. Annals of Statistics, 28(3):734–755, 2000. [Google Scholar]
  27. Liu Anna and Wang Yuedong. Hypothesis testing in smoothing spline models. Journal of Statistical Computation and Simulation, 74(8):581–597, 2004. [Google Scholar]
  28. Liu Meimei and Cheng Guang. Early stopping for nonparametric testing. In Advances in Neural Information Processing Systems, pages 3985–3994, 2018. [Google Scholar]
  29. Liu Meimei, Shang Zuofeng, and Cheng Guang. Sharp theoretical analysis for nonparametric testing under random projection. In Conference on Learning Theory, pages 2175–2209, 2019. [Google Scholar]
  30. Liu Qiang, Lee Jason, and Jordan Michael. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284, 2016. [Google Scholar]
  31. Ma Ping, Zhong Wenxuan, and Liu Jun S. Identifying differentially expressed genes in time course microarray data. Statistics in Biosciences, 1(2):144, 2009. [Google Scholar]
  32. Ma Ping, Mahoney Michael W, and Yu Bin. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1):861–911, 2015. [Google Scholar]
  33. Ma Siyuan and Belkin Mikhail. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pages 3778–3787, 2017. [Google Scholar]
  34. Munk Axel and Dette Holger. Nonparametric comparison of several regression functions: exact and asymptotic theory. Annals of Statistics, 26(6):2339–2368, 1998. [Google Scholar]
  35. Nichols Thomas E and Holmes Andrew P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1):1–25, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Okano Masaki, Xie Shaoping, and Li En. Cloning and characterization of a family of novel mammalian dna (cytosine-5) methyltransferases. Nature Genetics, 19(3):219, 1998. [DOI] [PubMed] [Google Scholar]
  37. Orrison William W, Lewine Jeffrey, Sanders John, and Hartshorne Michael F. Functional brain imaging. Elsevier Health Sciences, 2017. [Google Scholar]
  38. Pallasch CP, Schwamb J, Königs S, Schulz A, Debey S, Kofler D, Schultze JL, Hallek M, Ultsch A, and Wendtner CM. Targeting lipid metabolism by the lipoprotein lipase inhibitor orlistat results in apoptosis of b-cell chronic lymphocytic leukemia cells. Leukemia, 22(3):585–592, 2008. [DOI] [PubMed] [Google Scholar]
  39. Pinkus Allan. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012. [Google Scholar]
  40. Rami Lorena, Sala-Llonch Roser, Solé-Padullés Cristina, Fortea Juan, Olives Jaume, Lladó Albert, Peña-Gómez Cleofe, Balasa Mircea, Bosch Bea, Antonell Anna, Sanchez-Valle R, Bartrés-Faz D, and Molinuevo JL. Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer’s disease. Journal of Alzheimer’s Disease, 31(3):517–526, 2012. [DOI] [PubMed] [Google Scholar]
  41. Raskutti Garvesh, Wainwright Martin J, and Yu Bin. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335–366, 2014. [Google Scholar]
  42. Rombouts Serge ARB, Barkhof Frederik, Goekoop Rutger, Stam Cornelis J, and Scheltens Philip. Altered resting state networks in mild cognitive impairment and mild alzheimer’s disease: an fmri study. Human Brain Mapping, 26(4):231–239, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Scheff Stephen W, Price Douglas A, Ansari Mubeen A, Roberts Kelly N, Schmitt Frederick A, Ikonomovic Milos D, and Mufson Elliott J. Synaptic change in the posterior cingulate gyrus in the progression of alzheimer’s disease. Journal of Alzheimer’s Disease, 43(3):1073–1090, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Schölkopf Bernhard, Herbrich Ralf, and Smola Alex J. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416–426. Springer, 2001. [Google Scholar]
  45. Schübeler Dirk. Function and information content of dna methylation. Nature, 517(7534):321, 2015. [DOI] [PubMed] [Google Scholar]
  46. Shang Zuofeng and Cheng Guang. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608–2638, 2013. [Google Scholar]
  47. Shang Zuofeng and Cheng Guang. Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(1):3809–3845, 2017. [Google Scholar]
  48. Shen Xiaotong, Huang Hsin-Cheng, and Cressie Noel. Nonparametric hypothesis testing for a spatial signal. Journal of the American Statistical Association, 97(460):1122–1140, 2002. [Google Scholar]
  49. Smith Stephen M, Jenkinson Mark, Woolrich Mark W, Beckmann Christian F, Behrens Timothy EJ, Johansen-Berg Heidi, Bannister Peter R, De Luca Marilena, Drobnjak Ivana, Flitney David E, Niazy RK, Saunders J, Vickers J, Zhang Y, De Stefano N, Brady JM, and Matthews PM. Advances in functional and structural mr image analysis and implementation as fsl. Neuroimage, 23: S208–S219, 2004. [DOI] [PubMed] [Google Scholar]
  50. Stach Dirk, Schmitz Oliver J, Stilgenbauer Stephan, Benner Axel, DoÈhner Hartmut, Wiessler Manfred, and Lyko Frank. Capillary electrophoretic analysis of genomic DNA methylation levels. Nucleic Acids Research, 31(2):e2–e2, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sthle Lars and Wold Svante. Analysis of variance (anova). Chemometrics and Intelligent Laboratory Systems, 6(4):259–272, 1989. [Google Scholar]
  52. Storey John D, Xiao Wenzhong, Leek Jeffrey T, Tompkins Ronald G, and Davis Ronald W. Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102(36):12837–12842, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Subramanian Aravind, Tamayo Pablo, Mootha Vamsi K, Mukherjee Sayan, Ebert Benjamin L, Gillette Michael A, Paulovich Amanda, Pomeroy Scott L, Golub Todd R, Lander Eric S, , , and Mesirov Jill P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43): 15545–15550, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Vossoughi Mehrdad, Ayatollahi SMT, Towhidi Mina, and Heydari Seyyed Taghi. A distribution-free test of parallelism for two-sample repeated measurements. Statistical Methodology, 30:31–44, 2016. [Google Scholar]
  55. Wahba Grace. Spline models for observational data. SIAM, 1990. [Google Scholar]
  56. Wahba Grace, Wang Yuedong, Gu Chong, Klein Ronald, Klein Barbara, et al. Smoothing spline anova for exponential families, with application to the wisconsin epidemiological study of diabetic retinopathy: the 1994 neyman memorial lecture. Annals of Statistics, 23(6):1865–1895, 1995. [Google Scholar]
  57. Wang Liang, Zang Yufeng, He Yong, Liang Meng, Zhang Xinqing, Tian Lixia, Wu Tao, Jiang Tianzi, and Li Kuncheng. Changes in hippocampal connectivity in the early stages of alzheimer’s disease: evidence from resting state fmri. Neuroimage, 31(2):496–504, 2006. [DOI] [PubMed] [Google Scholar]
  58. Wang Yazhen. Change curve estimation via wavelets. Journal of the American Statistical Association, 93(441):163–172, 1998. [Google Scholar]
  59. Wang Yuedong. Smoothing splines: methods and applications. CRC Press, 2011. [Google Scholar]
  60. Wei Yuting and Wainwright Martin J. The local geometry of testing in ellipses: Tight control via localized kolmogorov widths. IEEE Transactions on Information Theory, 2020. [Google Scholar]
  61. Wood Simon N. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003. [Google Scholar]
  62. Yang Yun, Pilanci Mert, Wainwright Martin J, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, 45(3):991–1023, 2017. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES