Minimax Nonparametric Parallelism Test

Xin Xing; Meimei Liu; Ping Ma; Wenxuan Zhong

. Author manuscript; available in PMC: 2024 May 10.

Published in final edited form as: J Mach Learn Res. 2020;21:94.

Minimax Nonparametric Parallelism Test

Xin Xing ¹, Meimei Liu ¹, Ping Ma ², Wenxuan Zhong ²

PMCID: PMC11086968 NIHMSID: NIHMS1641658 PMID: 38737400

Abstract

Testing the hypothesis of parallelism is a fundamental statistical problem arising from many applied sciences. In this paper, we develop a nonparametric parallelism test for inferring whether the trends are parallel in treatment and control groups. In particular, the proposed nonparametric parallelism test is a Wald type test based on a smoothing spline ANOVA (SSANOVA) model which can characterize the complex patterns of the data. We derive that the asymptotic null distribution of the test statistic is a Chi-square distribution, unveiling a new version of Wilks phenomenon. Notably, we establish the minimax sharp lower bound of the distinguishable rate for the nonparametric parallelism test by using the information theory, and further prove that the proposed test is minimax optimal. Simulation studies are conducted to investigate the empirical performance of the proposed test. DNA methylation and neuroimaging studies are presented to illustrate potential applications of the test. The software is available at https://github.com/BioAlgs/Parallelism.

Keywords: asymptotic distribution, minimax optimality, nonparametric inference, parallelism test, penalized least squares, smoothing spline ANOVA, Wald test

1. Introduction

The assessment of parallelism is a fundamental problem in statistical inference and arises from many applications. For example, in genomic studies, one of primary interest is to detect genes with nonparallel expression patterns in time course studies (Storey et al., 2005; Ma et al., 2009). Another motivating example is in epigenomics, researchers are interested in testing whether the patterns of DNA methylation intensities along genome in the treatment and control groups are parallel or not (Hansen et al., 2012). The abnormal DNA methylation patterns are associated with changes in many important biological processes such as imprinting, X-chromosome inactivation, and aging (Schübeler, 2015). In functional neuroimaging, a common problem is to detect nonparallel signals (Nichols and Holmes, 2002; Orrison et al., 2017) among different brain regions.

There is an immense literature focused on the analysis of the parallelism of trends using linear model-based approaches, ranging from simple ANOVA (Sthle and Wold, 1989) to linear mixed models (Vossoughi et al., 2016). However, the linear model-based approaches have a limited ability to parsimoniously represent non-linear structures in complex data. Nonparametric parallelism comparison methods have drawn huge attention due to the modeling flexibility. Munk and Dette (1998) developed a test statistic through a weighted L₂ distances for the regression functions based on similar equal-spaced fixed design. Degras et al. (2011) tested the parallelism of multiple time series based on the L₂ distances between the local linear estimator of each individual curve and the global one for time series data when the time points are evenly spaced. Wang (1998) proposed a wavelet-based method to measure the changes of curves. Liu and Wang (2004) compared different nonparametric testing methods and showed that the performances of these tests depend on the shape of the true function. Ma et al. (2009) proposed an approximate F-test to detect nonparallel patterns in time course gene expression data with a more flexible random design.

However, rigorous testing methods with optimal power guarantees are still lacking in the existing nonparametric parallelism literature. The key cause of such research gap is that distinguishing from the simple/linear/polynomial null hypothesis, the parameter space of the null hypothesis for the nonparametric parallelism testing is a nonparametric function class with infinite dimension. How to conduct a rigorous test for such composite functional null hypothesis is still an open question. A major motivation of this article is on developing a nonparametric parallelism testing approach that detects the significance of the nonparallel effect, while guarantees statistical optimality in the sense of minimax testing rate, facilitating the power performance analysis.

In this article, we develop a nonparametric parallelism test based on the decomposition of tensor product reproducing Hilbert space (RKHS) (Wahba, 1990; Gu, 2013; Wang, 2011) under both fixed and random design. Tensor product RKHS provides a flexible space for modeling complex functions; see Wahba et al. (1995), Wood (2003) and reference therein. For the simplicity of description, we consider the case that there are two predictors only. Suppose the response variable Y_ij is the observed value of the jth subject at the ith time or spatial location for i = 1, ⋯ , n and j = 1, ⋯ , s. Y_ij depends on two predictors $x_{i}^{〈 1 〉}$ and $x_{j}^{〈 2 〉}$ through an unknown bivariate function $f (\cdot, \cdot) \in H$ , the tensor product RKHS, where $x_{i}^{〈 1 〉} \in X_{1} = [0, 1]$ is a continuous variable representing the ith time or the i-th spatial location, and $x_{j}^{〈 2 〉} \in X_{2} = {0, 1}$ is a discrete variable representing the jth subject in different groups, $x_{j}^{〈 2 〉} = 1$ represents the jth subject in treatment group, otherwise in control group. That is,

Y_{i j} = f (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + ϵ_{i j}, i = 1, \dots n, j = 1, \dots, s,

(1)

where ϵ_ijs are i.i.d. random noise following a normal distribution with mean zero, variance σ², and s is the number of subjects. Each subject can be represented by a curve. When s = 2, there are two curves in total and each group has only one curve. When s > 2, we have multiple curves in each group. We assume the i.i.d. random noise since, in many scientific experiments, the random errors are attributed to environmental factors independent of the time points or spatial location. For example, in the fMRI data analysis in Section 6, the error is mostly attributed to the random movement of the head and imaging noise which are independent with the time.

Analogous to the classical ANOVA decomposition, $f \in H$ has the smoothing spline ANOVA (SSANOVA) decomposition (Wahba, 1990):

f (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) = f_{00} + f_{10} (x_{i}^{〈 1 〉}) + f_{01} (x_{j}^{〈 2 〉}) + f_{11} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}),

(2)

where f₀₀ is the grand mean, f₁₀ and f₀₁ are the main effects, and f₁₁ is the nonparallel effect. When f₁₁ = 0 (see the left panel in Figure 1), the curves in two groups are parallel. Then f₁₁ = 0 is equivalent to that f(x^〈1〉, 0) and f(x^〈1〉, 1) are parallel. When f₁₁ ≠ 0 (see the right panel in Figure 1), the magnitude of ${‖ f_{11} ‖}_{2}^{2}$ characterizes the significance of the non-parallelism between the treatment and control groups, where ${‖ f_{11} ‖}_{2}^{2} = \sum_{x^{〈 2 〉} = 0}^{1} \int_{0}^{1} f^{2} (x^{〈 1 〉}, x^{〈 2 〉}) d ω_{1}$ , with ω₁ as the marginal density of x^〈1〉. Statistically, the hypothesis testing for parallelism can be formulated as

H_{0} : f_{11} = 0 vs H_{1} : f_{11} \neq 0.

(3)

We introduce two concrete examples which motivate our study.

Figure 1: — An illustration of two scenarios of a bivariate function f(x^〈1〉, x^〈2〉), where x^〈1〉 is continuous, x^〈2〉 only takes two values, 0 and 1. Left panel: the scenario with f₁₁ = 0, i.e., f(x^〈1〉, 0) and f(x^〈1〉, 1) are parallel. Right panel: the scenario with f₁₁ ≠ 0, i.e., f(x^〈1〉, 0) and f(x^〈1〉, 1) are nonparallel.

Example 1. DNA methylation in case-control study. DNA methylation is an essential epigenetic mechanism that regulates gene expression. Aberrant DNA methylation contributes to a number of human diseases including cancer (Stach et al., 2003). In a typical case-control study of DNA methylation (Filarsky et al., 2016), the DNA methylation level, denoted as Y_ij at the ith location $x_{i}^{〈 1 〉}$ on the genome for the jth individual in group $x_{j}^{〈 2 〉}$ , can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). A primary focus is to infer whether the DNA methylation levels have different profiles along the genome between the case and control groups, i.e., testing the presence/absence of nonparallel effect f₁₁ as in Equation (3).

Example 2. Neuroimaging using functional magnetic resonance imaging (fMRI). fMRI is a powerful neuroimaging technology for the diagnosis of many brain-related diseases. It measures brain activity by detecting changes associated with blood flow. The primary form of fMRI uses the blood-oxygen-level dependent (BOLD) as signal (Huettel et al., 2004). In many case-control studies, the BOLD signal, Y_ij, at the ith time $x_{i}^{〈 1 〉}$ for the jth subject in group $x_{j}^{〈 2 〉}$ is measured for a particular region of interest (ROI), and can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). The goal is to test whether the BOLD signals in two groups have same patterns along the time, i.e., test the significance of nonparallel effect f₁₁ in Equation (3).

We first establish the minimax lower bound for nonparametric parallelism test in Equation (3) for general testing rules with the aid of tensor product decomposition of RKHS and the information theory. The tensor product decomposition in Equation (2) enables us to quantify the magnitude of nonparalelism by ∥f₁₁∥₂, where ∥ · ∥₂ is the L₂ norm. Intuitively, the smaller ∥f₁₁∥₂ is, the harder it is to distinguish the alternative hypothesis from the null. In analyzing the power performance, we consider a slightly different alternative hypothesis,

H_{1}^{*} : {‖ f_{11} ‖}_{2} \geq d_{n},

(4)

where we remove the neighborhood within the d_n distance of f₁₁ = 0 from the original alternative H₁. Here the sequence d_n is called the distinguishable rate (or separation rate) (Ingster and Suslina, 2012; Giné and Nickl, 2015). We first introduce a geometric interpretation of the testing problem in Equation (3), and then establish a general minimax lower bound for the distinguishable rate for the nonparametric parallelism test using the Bernstein k-width in information theory (Pinkus, 2012). Bernstein k-width provides a geometric measure of the distinguishable rate and is easy to evaluate in the tensor product RKHS. Recently, similar technique was also used in analyzing the testing problems over cones and studied in Gaussian sequence models (Wei and Wainwright, 2020).

In addition, we propose a Wald-type test statistic as the squared empirical norm of the penalized least square estimator of f₁₁. We derive its asymptotic null distribution, which satisfies the Wilks phenomenon. The asymptotic distribution of our test statistic is Gaussian, and the testing rule does not depend on any unknown quantities, thus is easy to compute. We can further reduce the computational cost by applying many popular fast computation methods such as fast random kernel methods Alaoui and Mahoney (2015) and subsampling methods such as Ma et al. (2015); Kim and Gu (2004). We note that our proposed Wald-type test distinguishes from the existing nonparametric testing methods as follows. The existing testing procedures mostly consider simple null hypothesis, such as the generalized likelihood ratio test in Fan et al. (2001), the penalized likelihood ratio test in Shang and Cheng (2013), the wavelet based method in Shen et al. (2002), and kernelized Stein method in Liu et al. (2016), whereas we consider a composite null hypothesis. More importantly, there is a nontrivial technical complication in addition to the above model setting difference. The composite null hypothesis H₀ : f₁₁ = 0 here defines a nonparametric function in an infinite-dimensional functional space rather than a parametric function in a finite-dimensional parameter space as required in Shang and Cheng (2013), because testing H₀ : f₁₁ = 0 is equivalent to testing H₀ : f ∈ {f₀₀ + f₁₀ + f₀₁}. Developing the limiting distribution of the test statistic in an infinite-dimensional null hypothesis space and quantifying the testing difficulty are very challenging since the distribution relies on the more delicate tensor product decomposition of the RKHS.

We further prove that the upper bound of the distinguishable rate for the proposed Wald type test matches the established minimax lower bound. Thus the proposed Wald-type test is minimax optimal. To the best of our knowledge, our work is the first one in establishing the minimax nonparametric parallelism test. Based on the Wald-type test statistic, we propose a data-adaptive choice of the regularization parameter with testing optimality guarantee.

The rest of the paper is organized as follows. We introduce the background of tensor product RKHS in Section 2. In Section 3, we introduce a minimax principle and a geometric interpretation of the parallelism testing problem. In Section 4, we derive the minimax lower bound of the distinguishable rate for general parallelism test using the information theory. Section 5 presents various simulation studies demonstrating substantial performance of our testing method, and Section 6 applies the methods to genome-wide anomaly of DNA methylation in chronic lymphocytic leukemia patients and brain function change in patients with Alzheimer disease. We conclude with a few remarks in Section 7. All technical proofs are relegated to the Appendix and Supplementary Material.

2. Background

In this section, we introduce some background of the tensor product RKHS, its tensor product decomposition, together with the penalized least square estimation.

2.1. Reproducing Kernel Hilbert Space

Given an RKHS $H$ with an inner product ${〈 \cdot, \cdot 〉}_{H}$ , there exists a symmetric and square integrable function $K (\cdot, \cdot) : X \times X \to ℝ$ such that

{〈 f, K (x, \cdot) 〉}_{H} = f (x), for all f \in H and x \in X .

We call $K$ as the reproducing kernel of $H$ . By Mercer’s theorem, any continuous kernel has the following decomposition

K (x, y) = \sum_{ν = 0}^{\infty} λ_{ν} φ_{ν} (x) φ_{ν} (y),

(5)

where λ_νs are non-negative descending eigenvalues and φ_νs are eigen-functions.

We consider the bivariate function f in Equation (1) on the product domain $X_{1} \times X_{2}$ . We assume that f is a function in a tensor product RKHS (Lin, 2000)

H = H^{(1 〉} \otimes H^{〈 2 〉} .

(6)

Given the Hilbert space $H^{〈 1 〉}$ and $H^{〈 2 〉}$ , $H^{〈 1 〉} \otimes H^{〈 2 〉}$ is defined as the completion of the class of functions with the form $\sum_{i = 1}^{M} η_{1 i} (x) η_{2 i} (y)$ , for $η_{1 i} \in H^{〈 1 〉}$ , $η_{2 i} \in H^{〈 2 〉}$ , and M is any positive integer. We consider $H^{〈 1 〉}$ as an mth order homogeneous Sobolev space, i.e.,

H^{〈 1 〉} = {η_{1} \in L_{2} [0, 1] ∣ η_{1}^{(k)} is absolutely continuous and η_{1}^{(k)} (0) = η_{1}^{(k)} (1) for k = 0, 1, \dots, m - 1, η_{1}^{(m)} \in L_{2} [0, 1]},

and $H^{〈 2 〉}$ is a two-dimensional Euclidean space with standard Euclidean norm.

Assume that $H^{〈 1 〉}$ has the eigenvalue and eigenvector pairs ${μ_{i}, ϕ_{i}}_{i = 0}^{\infty}$ and $H^{〈 2 〉}$ has the eigenvalue and eigenvector pairs ${ν_{j}, ψ_{j}}_{j = 1}^{2}$ . Then we have the eigenvalue and eigenvector pairs for the kernel function $K$ in $H$ as

{μ_{i} ν_{j}, ϕ_{i} ψ_{j}} for i = 0, \dots, \infty, j = 1, 2,

(7)

in the decomposition in Equation (5). We refer Equation (7) as the eigensystem for $H$ . We further denote ${〈 \cdot, \cdot 〉}_{H}$ as the product norm induced by the norm on the marginal space $H_{1}$ and $H_{2}$ (Lin, 2000).

Using the Riesz representation theorem (Schölkopf et al., 2001), we can easily represent any function $f \in H$ as in the following Lemma.

Lemma 1 Given the sampling points $x_{i j} = (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ , i = 1, ⋯ , n and j = 1, ⋯ s, for any f in a reproducing kernel Hilbert space $H$ , there exists a set of reproducing kernels $K_{x_{i j}} (\cdot, \cdot)$ such that

f (x^{〈 1 〉}, x^{〈 2 〉}) = \sum_{i = 1}^{n} \sum_{j = 1}^{s} α_{i j} K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉}) + ρ (x^{〈 1 〉}, x^{〈 2 〉}) .

(8)

Lemma 1 implies that f can be expressed as a sum of a linear expansion of $K_{x_{i j}}$ and a nonlinear function ρ. Notice that when $(x^{〈 1 〉}, x^{〈 2 〉}) \in {x_{i j}}_{i = 1, \dots, n}^{j = 1, \dots, s}$ , we have ρ(x^〈1〉, x^〈2〉) = 0. Thus, ρ(·,·) can be considered as a residual that quantifies the unknown information of function f. To get an estimate of f, we only need to specify $K_{x_{i j}} (\cdot, \cdot)$ and estimate α_ij. Next, we provide a way to construct the reproducing kernels $K_{x_{i j}} (\cdot, \cdot)$ . In order to do that, we need the following two lemmas.

Lemma 2 Suppose $K^{〈 1 〉}$ is the reproducing kernel of $H^{〈 1 〉}$ on $X_{1}$ , and $K^{〈 2 〉}$ is the reproducing kernel of $H^{〈 2 〉}$ on $X_{2}$ . Then the reproducing kernels of $H^{〈 1 〉} \otimes H^{〈 2 〉}$ on $X = X_{1} \times X_{2}$ is $K (x, z) = K^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}) K^{〈 2 〉} (x^{〈 2 〉}, z^{〈 2 〉})$ with x = (x^〈1〉, x^〈2〉) and z = (z^〈1〉, z^〈2〉.

Lemma 3 For every Sobolev space $H$ of functions on $X$ , there corresponds a unique reproducing kernel $K$ , which is non-negative definite. If $K_{0}$ and $K_{1}$ are both non-negative definite reproducing kernels for $H_{0}$ and $H_{1}$ , and $H_{0} \cap H_{1} = {0}$ , then $H_{0} \oplus H_{1}$ has a reproducing kernel $K = K_{0} + K_{1}$ .

Lemmas 2 and 3 can be easily proved based on Theorems 2.3 to 2.6 in Gu (2013). Lemma 2 states that the reproducing kernel of the tensor product space is the product of the reproducing kernels. Lemma 3 states that the reproducing kernel of a tensor sum space is the sum of the reproducing kernels. Therefore, to construct $K_{x_{i j}} (\cdot, \cdot)$ , we introduce the decomposition of tensor product space in the following part.

2.2. Decomposition of Tensor Product Space

For any $η_{1} \in H^{〈 1 〉}$ and $η_{2} \in H^{〈 2 〉}$ , define the averaging operators $A_{1} : η_{1} \to \int_{0}^{1} η_{1} (x) d x$ and $A_{2} : η_{2} \to \frac{1}{2} \sum_{k = 1}^{2} η_{2} (k)$ where $η_{2} (k) = e_{k}^{T} η_{2}$ , e_k is the unit vector with the kth element one and all other elements zeros. We have $H^{〈 1 〉}$ and $H^{〈 2 〉}$ with the following tensor sum decomposition $H_{0}^{〈 1 〉} \oplus H_{1}^{〈 1 〉}$ and $H_{0}^{〈 2 〉} \oplus H_{1}^{〈 2 〉}$ respectively, where $H_{0}^{〈 1 〉} = {A_{1} η_{1} | η_{1} \in H^{〈 1 〉}}$ , $H_{0}^{〈 2 〉} = {A_{2} η_{2} | η_{2} \in H^{〈 2 〉}}$ , $H_{1}^{〈 1 〉} = {(I - A_{1}) η_{1} | η_{1} \in H^{〈 1 〉}}$ , $H_{1}^{〈 2 〉} = {(I - A_{2}) η_{2} | η_{2} \in ℝ^{2}}$ , and $I$ is the operator. Thus $H$ has the following tensor sum decomposition

H = (H_{0}^{〈 1 〉} \otimes H_{0}^{〈 2 〉}) \oplus (H_{1}^{〈 1 〉} \otimes H_{0}^{(2 〉}) \oplus (H_{0}^{〈 1 〉} \otimes H_{1}^{〈 2 〉}) \oplus (H_{1}^{〈 1 〉} \otimes H_{1}^{〈 2 〉}),

(9)

and for any $f \in H^{〈 1 〉} \otimes H^{〈 2 〉}$ , we have

f = f_{00} + f_{10} + f_{01} + f_{11},

(10)

where $f_{00} = A_{1} A_{2} f \in H_{0}^{〈 1 〉} \otimes H_{0}^{〈 2 〉}$ , $f_{10} = (I - A_{1}) A_{2} f \in H_{1}^{〈 1 〉} \otimes H_{0}^{〈 2 〉}$ , $f_{01} = A_{1} (I - A_{2}) f \in H_{0}^{1} \otimes H_{1}^{2}$ and $f_{11} = (I - A_{1}) (I - A_{2}) f \in H_{1}^{〈 1 〉} \otimes H_{1}^{〈 2 〉}$ . Thus, any function $f \in H$ can be decomposed uniquely as : f₀₀ the interception, f₁₀ and f₀₁ the marginal effects and f₁₁ the two-way interaction term.

Denote the reproducing kernels of $H_{0}^{〈 1 〉}$ , $H_{0}^{〈 2 〉}$ , $H_{1}^{〈 1 〉}$ , $H_{1}^{〈 2 〉}$ as $K_{0}^{〈 1 〉}$ , $K_{1}^{〈 1 〉}$ , $K_{0}^{〈 2 〉}$ , $K_{1}^{〈 2 〉}$ , respectively. Specifically, $K_{0}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}) = 1$ and $K_{1}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉})$ is defined as (−1)^m−1k_2m(z^〈1〉 − x^〈1〉) for the mth order homogeneous subspace where k_r(·) is the rth order scaled Bernoulli polynomials (Abramowitz and Stegun, 1964; Gu, 2013) and 1_(·) is the indicator function. $K_{0}^{〈 2 〉} (x^{〈 2 〉}, z^{〈 2 〉}) = 1 / 2$ and $K_{1}^{〈 2 〉} (x^{〈 2 〉}, z^{〈 2 〉}) = 1_{(z^{〈 2 〉} = x^{〈 2 〉})} - 1 / 2$ on $X_{2}$ . Let $H_{l l^{'}} = H_{l}^{〈 1 〉} \otimes H_{l^{'}}^{〈 2 〉}$ with reproducing kernel $K^{l l^{'}}$ , where

K^{l l^{'}} (x_{i j}, x_{i^{'} j^{'}}) = K_{l}^{〈 1 〉} (x_{i}^{〈 1 〉}, x_{i^{'}}^{〈 1 〉}) K_{l^{'}}^{〈 2 〉} (x_{j^{'}}^{〈 2 〉}, x_{j^{'}}^{〈 2 〉}),

for ℓ,ℓ′ ∈ {0, 1}. The induced inner product of $H_{l l^{'}}$ is denoted as 〈f_ℓℓ′, g_ℓℓ′〉_ℓℓ′, where f_ℓℓ′ and g_ℓℓ′ are projections of f and g on $H_{l l^{'}}$ respectively, ℓ,ℓ′ ∈ {0, 1}. Notice that the metrics induced by inner products 〈f_ℓℓ′, g_ℓℓ′〉_ℓℓ′ are not necessarily of the same scale for different ℓℓ^′. The inner product for $H$ can be defined as

{〈 f, g 〉}_{H} = \sum_{l l^{'}} θ_{l l^{'}}^{- 1} {〈 f_{l l^{'}}, g_{l l^{'}} 〉}_{l l^{'}},

(11)

where θ_ℓℓ′s re-scale the metrics on different $H_{l l^{'}}$ , 〈·,·〉_ℓℓ′ is the restricted norm of ${〈 \cdot, \cdot 〉}_{H}$ on $H_{l l^{'}}$ .

Based on Lemmas 2 and 3, we can easily show that the reproducing kernels associated with Equation (11) is $K (x_{i j}, x_{i^{'} j^{'}}) = \sum_{l, l^{'}} θ_{l l^{'}} K^{l l^{'}} (x_{i j}, x_{i^{'} j^{'}})$ with ℓ,ℓ′ = 0, 1. Thus, given the sampling points $x_{i j} = (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ for i = 1, ⋯ , n and j = 1, ⋯ , s, the kernel function in $H$ is a bivariate function depending on x_ij, i.e.,

K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉}) = \frac{θ_{00}}{2} + θ_{01} (1_{(x^{〈 2 〉} = x_{j}^{〈 2 〉})} - \frac{1}{2}) + θ_{10} \frac{1}{2} K_{1}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}) + θ_{11} \frac{1}{2} (1_{(x^{〈 2 〉} = x_{j}^{〈 2 〉})} - \frac{1}{2}) K_{1}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}),

(12)

and accordingly $f (x^{〈 1 〉}, x^{〈 2 〉}) = \sum_{i j} α_{i j} K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉}) + ρ (x^{〈 1 〉}, x^{〈 2 〉})$ by Lemma 1.

In the function decomposition in Equation (10), it is easy to verify that $f_{00} \in H_{00} = {g : g = {(θ_{00} - θ_{01}) / 2} \sum_{i j} α_{i j}}$ . As f₀₀ is a constant for any x^〈1〉 and x^〈2〉, it is analogous to the ground mean in classical ANOVA models. Similarly, we have $f_{01} \in H_{01} = {g : g = θ_{01} \sum_{i j} α_{i j} 1_{(x^{〈 2 〉} = x_{j}^{〈 2 〉})}}$ . Recall that $x_{j}^{〈 2 〉}$ can only be either 0 or 1, we can rewrite f₀₁ as $1_{(x^{〈 2 〉} = 0)} β_{0} + 1_{(x^{〈 2 〉} = 1)} β_{1}$ , where $β_{0} = \sum_{j = 1}^{s} (\sum_{i = 1}^{n} α_{i j}) 1_{(x_{j}^{〈 2 〉} = 0)}$ and $β_{1} = \sum_{j = 1}^{s} (\sum_{i = 1}^{n} α_{i j}) 1_{(x_{j}^{〈 2 〉} = 1)}$ .

We remark that f₀₀ and f₀₁ are all in a finite-dimensional space. The space $H_{10}$ (where f₁₀ lies in) spanned by the third term in the right hand side of Equation (12) is, however, an infinite-dimensional space, because we have uncountable $x \in X_{1}$ . The function can be expressed as a linear combination of the observed reproducing kernels plus a residual that quantifies the unobserved reproducing kernels, i.e., $H_{10} = {g : g = \frac{1}{2} \sum_{i = 1}^{n} (θ_{10} \sum_{j = 1}^{s} α_{i j}) K_{1}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}) + ρ_{2}}$ . Notice that function in this space only changes as we change x^〈1〉. Thus, the third term in right hand side of (12) can be used to quantify the effect of the continuous variable such as the temporal effect. The forth term in the right hand side of Equation (12) varies for both continuous variable and the case-control indicator, thus it is the term that can catch different functional patterns between the case and control. Similarly, the space spanned by the last addend is also an infinite-dimensional space because we still have an infinite number of unobserved kernel functions in addition to the n × s observed kernel functions. Thus, we he $f_{11} \in H_{11} = {g : g = \frac{1}{2} θ_{11} \sum_{i j} α_{i j} (1_{(x_{j}^{〈 2 〉} = x^{2})} - \frac{1}{2}) K_{1}^{〈 1 〉} (x^{〈 1 〉}, z^{〈 1 〉}) + ρ_{12}}$ . Clearly, to test if two functions are parallel to each other, we only need to test if f₁₁ = 0.

2.3. Penalized Least Squares

Here we introduce the penalized least square estimate of $f \in H$ , and the interaction term f₁₁ in Equation (10). Given the sampling points $x_{i j} = (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ for i = 1, …, n and j = 1, …, s, consider the model space

H_{model} = {g : g = \sum_{i = 1}^{n} \sum_{j = 1}^{s} α_{i j} K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉})},

a closed linear subspace of $H$ . α_ijs are the regression coefficients, and the bivariate residual function ρ(·,·) in Lemma 1 is in $H_{residual} = H ⊖ H_{model}$ . Notice that $ρ (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) = 〈 K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉}), ρ 〉 = 0$ because of the orthogonality constraint between $H_{model}$ and $H_{residual}$ . Then, f can be estimated by minimizing the penalized least squares functional as follows:

\frac{1}{n s} \sum_{i = 1}^{n} \sum_{j = 1}^{s} {(Y_{i j} - \sum_{i^{'} j^{'}} α_{i^{'} j^{'}} K_{x_{i^{'} j^{'}}} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}))}^{2} + λ J (f_{10} + f_{11}),

(13)

where the quadratic functional $J (f) = J (f_{10} + f_{11}) = {‖ f_{10} + f_{11} ‖}_{H}^{2}$ quantifies the roughness of f₁₀ and f₁₁, the smoothing parameter λ controls the trade-off between the goodness-of-fit and the roughness of f₁₀ and f₁₁. Recall ρ and $K_{x_{i^{'} j^{'}}} (\cdot, \cdot)$ are orthogonal to each other. Plugging Equation (8) into J(f), we have

J (f) = {〈 \sum_{i^{'} j^{'}} α_{i^{'} j^{'}} (θ_{10} K_{x_{i^{'} j^{'}}}^{10} + θ_{11} K_{x_{i^{'} j^{'}}}^{11}), \sum_{i^{'} j^{'}} α_{i^{'} j^{'}} (θ_{10} K_{x_{i^{'} j^{'}}}^{10} + θ_{11} K_{x_{i^{'} j^{'}}}^{11}) 〉}_{H} + {〈 ρ, ρ 〉}_{H} .

Further notice that $〈 K_{x_{i j}}^{l l^{'}}, K_{x_{i^{'} j^{'}}}^{l l^{'}} 〉 = K_{x_{i j}}^{l l^{'}} (x_{i^{'}}^{〈 1 〉}, x_{j^{'}}^{〈 2 〉})$ by the reproducible property of reproducing kernels (Gu, 2013). Thus, substituting $K$ and $K^{l l^{'}}$ by (12) and f in J(f) by Equation (8), Equation (13) can be rewritten as

‖ y - nsK α ‖_{2}^{2} + n s λ α^{T} Q α + n s λ {〈 ρ, ρ 〉}_{H},

(14)

where y = (Y₁₁, Y₂₁, …, Y_ns)^T, K is the ns × ns matrix with (i + n(j − 1), i′ + n(j′ − 1))th entry $\frac{1}{n s} K_{x_{i j}} (x_{i^{'}}^{〈 1 〉}, x_{j^{'}}^{〈 2 〉})$ , Q is the ns × ns matrix with (i + n(j − 1), i′ + n(j′ − 1))th entry $\frac{1}{n s} (θ^{10} K_{x_{i j}}^{10} (x_{i^{'}}^{〈 1 〉}, x_{j^{'}}^{〈 2 〉}) + θ^{11} K_{x_{i j}}^{11} (x_{i^{'}}^{〈 1 〉}, x_{j^{'}}^{〈 2 〉}))$ and α = (α₁₁, α₂₁, …, α_ns)^T. Similar to Chapter A3 in Gu (2013), we set the rescale parameter θ₁₀ and θ₁₁ to make $θ_{10} K^{10}$ and $θ_{11} K^{11}$ contribute equally in penalty term of Equation (14) (see Appendix A.1 for details) and set θ₀₀ and θ₀₁ as one since $H_{00}$ and $H_{01}$ are simply one-dimensional Euclidean space. Since ρ does not rely on α, the optimizer of α in minimizing Equation (14) is equivalent to minimizing

\hat{α} = \underset{α \in ℝ^{n s}}{arg min} ‖ y - nsK α ‖_{2}^{2} + n s λ α^{T} Q α .

(15)

The penalized least square estimate of f is then $\hat{f} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) = \sum_{i, j}^{n, s} {\hat{α}}_{i, j} K_{x_{i, j}} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ .

As n goes to infinity, we have countable number of kernels and f(x^〈1〉, x^〈2〉) that the minimizer of Equation (13) resides in an infinite dimensional space spanned by a countable number of kernels, i.e.,

H_{model}^{\infty} = {g : g (x^{〈 1 〉}, x^{〈 2 〉}) = \sum_{i j}^{\infty} α_{i j} K_{x_{i j}} (x^{〈 1 〉}, x^{〈 2 〉})} .

The nonparallel effect f₁₁ also resides in a subspace that is spanned by a countable number of kernels. We denote the subspace by

H_{11}^{\infty} = {f_{11} : f_{11} (x^{〈 1 〉}, x^{〈 2 〉}) = \sum_{i j}^{\infty} α_{i j} \frac{{(- 1)}^{m - 1}}{2} (1_{(x_{j}^{(2)} = x^{(2)})} - \frac{1}{2}) k_{2 m} (x_{i}^{〈 1 〉} - x^{〈 1 〉})} .

Here, we did not normalize f₁₁ by the constant scale parameter θ₁₁ for the simplicity of description. The penalized least square estimate of $f_{11} \in H_{11}^{\infty}$ is

{\hat{f}}_{11} (x^{〈 1 〉}, x^{〈 2 〉}) = \sum_{i, j}^{n, s} {\hat{α}}_{i j} \frac{{(- 1)}^{m - 1}}{2} (1_{(x_{j}^{〈 2 〉} = x^{〈 2 〉})} - \frac{1}{2}) k_{2 m} (x_{i}^{〈 1 〉} - x^{〈 1 〉}) .

(16)

With a little abuse of notation, we use ${\hat{f}}_{11}$ to denote the vector version evaluation of ${\hat{f}}_{11}$ on ns data points from now on. Plugging in $\hat{α}$ to (16), we have an explicit expression of ${\hat{f}}_{11}$ as

{\hat{f}}_{11} = K_{11} M^{- 1} (I_{n s} - S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) y,

(17)

where I_ns is the ns dimensional identity matrix, S, M and K₁₁ are reparametrization of the kernel matrices with explicit forms provided in Appendix A.1–“Notation Clarification”.

In Section 4, we will construct a Wald type test statistics based on ${\hat{f}}_{11}$ for the parallelism test H₀ : f₁₁ = 0, and derive its null asymptotic distribution. Before that, we first establish the minimax principle of the parallelism test for general testing rules in the following Section 3.

3. Minimax Principle of the Nonparametric Parallelism Test

Consider the test problem as follows

H_{0} : f_{11} = 0 v s H_{1} : {‖ f_{11} ‖}_{2} > 0.

(18)

Given a decision rule ϕ_n for the testing problem (18), ϕ_n = 0 if H₀ is preferred and 1 otherwise. Then the zero-one loss function is

Loss (ϕ_{n}) = {\begin{array}{l} ϕ_{n} & if H_{0} is true, \\ 1 - ϕ_{n} & if H_{1} is true . \end{array}

(19)

The minimax principle requires ϕ_n to minimize the maximum possible risk, i.e.,

min_{ϕ_{n}} max_{H} E [Loss (ϕ_{n})] = min_{ϕ_{n}} [max_{H_{0}} E (ϕ_{n} | H_{0} is true) + max_{H_{1}} E (1 - ϕ_{n} | H_{1} is true)] .

(20)

Notice $E (ϕ_{n} | H_{0} is true)$ is the probability of making a type I error and $E (1 - ϕ_{n} | H_{1} is true)$ is the probability of making a type II error. Intuitively, we choose ϕ_n to minimize the maximum possible type I error and type II error. Notice that if H₀ and H₁ are contiguous, we cannot ensure that Equation (20) can be controlled, because there may lie some f₁₁ on the boundary of H₀ and H₁ for which strikes the balance between acceptance and rejection of the null hypothesis, and an appropriate decision cannot be made. Thus, instead of H₁, we consider a slightly different alternative hypothesis (4) and partition the parameter space into three sets: $H_{0} + H_{1}^{*} + I$ , of which I designates the indifference zone 0 < ∥f₁₁∥₂ < d_n. Because d_n clearly separates H₀ from $H_{1}^{*}$ , it is referred to as the distinguishable rate (a.k.a the separation rate) (Ingster and Suslina, 2012; Giné and Nickl, 2015). Let

pseudo.risk (ϕ_{n}, d_{n}) = sup_{H_{0}} E (ϕ_{n} | H_{0} is true) + sup_{H_{1}^{*}} E (1 - ϕ_{n} | H_{1}^{*} is true) .

(21)

Then pseudo.risk(ϕ_n, d_n) converges to the risk function $E [Loss (ϕ_{n})]$ as d_n goes to zero.

Compared to the risk function, the pseudo.risk is not only a function of a decision rule ϕ_n but also a function of the distinguishable rate d_n. When ϕ_n is given, we have ${sup}_{H_{1}^{*}} E (1 - ϕ_{n} | H_{1}^{*} is true) \leq {sup}_{H_{1}} E (1 - ϕ_{n} | H_{1} is true)$ because $H_{1}^{*}$ is a subset of H₁. Thus, finding the largest pseudo.risk on $H_{1}^{*}$ for a given ϕ_n is equivalent to finding the smallest d_n with a tolerable pseudo.risk. In another word, finding the maximum possible pseudo.risk over the parameter space can be considered as finding the smallest boundary of $H_{1}^{*}$ such that an appropriate decision ϕ_n can be made and the risk can be controlled. Meanwhile, for an adequately large given d_n, we can always find a decision rule such that the pseudo.risk can reach its minimum value. Let $ϕ_{n}^{†} (d_{n}) = arg {min}_{ϕ_{n}}$ pseudo.risk(ϕ_n, d_n). Then, if d_n can reach the smallest value $d_{n}^{†}$ , the corresponding $ϕ_{n}^{†} (d_{n}^{†})$ is the minimax decision. Thus, the essential step to find the minimax decision of pseudo.risk(ϕ_n, d_n) is to find $d_{n}^{†}$ such that

d_{n}^{†} = \underset{d_{n}}{arg min} ϕ_{n}^{†} (d_{n}) .

(22)

Because $d_{n}^{†}$ is an estimate of the distinguishable rate to obtain the minimax test, it is referred to as the minimax distinguishable rate. Clearly, the corresponding decision rule $ϕ_{n}^{†}$ is the minimax decision rule.

We first introduce a geometric interpretation of the testing problem (18). Geometrically, we can treat $E = {f \in H : ‖ f ‖_{H} < 1 / 2}$ as an ellipse with eigenvalues in Equation (7) as axis lengths as shown in Figure 2. For any $f \in E$ , the projection of f on ${f : f \in E_{11} : = H_{11} \cap E}$ is f₁₁. The magnitude of nonparallelism can be qualified by ∥f₁₁∥₂. The distinguishable rate d_n is the radius of the sphere centered at f₁₁ = 0 in $H_{11}$ .

Figure 2: — Geometric interpretation of the distinguishable rate of the parallelism test.

Intuitively, the testing will be harder when the projection of f on $E_{11}$ is closer to the origin f₁₁ = 0. We use the Bernstern width in Pinkus (2012) to characterize the testing difficulty. Let S_k+1 be the set of all (k + 1)-dimensional subspaces for any k ≥ 1. For a compact set C, the Bernstein k-width is defined as

b_{k, 2} (C) ≔ \underset{r \geq 0}{arg max} {B_{2}^{k + 1} (r) \subset C \cap S for some subspace S \in S_{k + 1}},

(23)

where $B_{2}^{k + 1} (r)$ is a (k + 1)-dimensional l₂-ball with radius r centered at f₁₁ = 0 in $E_{11}$ . The Bernstein width characterizes the largest ball that can be inserted into a (k+1)-dimensional subspace in $E_{11}$ . Based on the Bernstein width, we give an upper bound of the testing radius, i.e., for any f projected in the ball with radius less than this upper bound, the minimum pseudo.risk is larger than 1/2.

Lemma 4 For any $f \in H$ , we have

inf_{ϕ_{n}} pseudo.risk (ϕ_{n}, d_{n}) \geq 1 / 2

for all

d_{n} \leq r_{B} ≔ sup {δ | δ \leq \frac{1}{2 \sqrt{n}} σ {(k_{B} (δ))}^{1 / 4}}

where $k_{B} (δ) ≔ arg {max}_{k} {b_{k - 1, 2}^{2} (H_{11}) \geq δ^{2}}$ is the Bernstein lower critical dimension, and r_B is called the Bernstein lower critical radius.

Lemma 4 shows that when d_n is less than r_B, there has no test can distinguish the alternative hypothesis from the null. In order to achieve a non-trivial power, we need d_n to be larger than the Bernstein lower critical radius r_B, which is determined by the Bernstein lower critical dimension k_B(δ). In the next lemma, we provide the lower bound for k_B(δ).

Lemma 5 Let ${ρ_{i}}_{i = 1}^{\infty}$ be eigenvalues of $H_{11}$ . We have

k_{B} (δ) > \underset{k}{arg max} {\sqrt{ρ_{k}} \geq δ}

(24)

Plugging in the lower bound of k_B(δ) derived in Lemma 5 to Lemma 4, we calculate a lower bound for r_B based on the decay rate of eigenvalues. r_B is served as a minimax lower bound for the distinguishable rate. The following theorem summarizes the minimax distinguishable rate for the testing problem (18).

Theorem 6 (Minimax lower bound for distinguishable rate) In the nonparametric model (1) with SSANOVA (2). Suppose $f \in H$ , where $H = H^{〈 1 〉} \otimes H^{〈 2 〉}$ with $H^{〈 1 〉}$ as the mth order Sobolev space¹, and $H^{〈 2 〉}$ as a two-dimensional Euclidean space. The minimax distinguishable rate for testing hypotheses (18) is achieved at $d_{n}^{†} ≳ n^{- 2 m / (4 m + 1)}$ .

Theorem 6 provides a general guidance for justifying a local minimax test, i.e., there is no test can distinguish the alternative from null if d_n ≲ n^−2m/(4m+1). The proof of Theorem 6 is presented in the Appendix. Essentially, for any test ϕ_n that is defined by a family of type I error $α = E (ϕ_{n})$ and by the supremum of the type II error $δ = {sup}_{H_{1}^{*}} E (1 - ϕ_{n} | H_{1}^{*} is true)$ , we need ϕ_n converges to zero faster than d_n to ensure the distinguishability of the null distribution. We further remark that the minimax rate for nonparametric estimation is n^−m/(2m+1) (Yang et al. (2017)) which is higher than the minimax distinguishable rate n^−2m/(4m+1). In the next section, we will introduce a Wald type test for the hypothesis testing (18) with the separation rate d_n achieves the lower bound n^−2m/(4m+1) indicating our proposed test is minimax optimal.

4. Wald Type Parallism Test

In this section, we propose a Wald type test statistics based on the penalized least squares estimate of f₁₁, and derive the asymptotic distribution of the test statistics. We further prove an upper bound of the distinguishable rate of the Wald type test which matches the minimax lower bound established in Theorem 6.

4.1. Wald Type Test and Asymptotic Distribution

The nonparallel effect of the curves between the case group and the control group is measured by the magnitude of ${‖ f_{11} ‖}_{2}^{2}$ . The nonparallel test in Equation (3) is equivalent to

H_{0} : f \in H_{model}^{\infty} ⊖ H_{11}^{\infty} v s H_{1} : f \in H_{model}^{\infty}

or equivalently, $H_{1} : f_{11} \in H_{11}^{\infty}$ . First, notice that the null hypothesis in Equation (18) is a composite hypothesis as the null hypothesis defines a class of functions in $H_{model}^{\infty} ⊖ H_{11}^{\infty}$ . Second, H₀ defines an infinite dimensional parameter spaces as n → ∞, the assumptions of Neyman-Pearson Lemma cannot be satisfied. Thus the uniformly most powerful test may not exist in general. To overcome the difficulty, we propose a Wald-type test

T_{n, λ} = \frac{1}{n s} {‖ {\hat{f}}_{11} ‖}_{2}^{2}

(25)

and show its minimax optimality.

Since Y_ij follows Equation (1) with f satisfying the SSANOVA decomposition in Equation (2), we can replace each element in vector y by $f_{00} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + f_{10} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + f_{01} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + f_{11} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + ϵ_{i j}$ . Then plug in the expression of ${\hat{f}}_{11}$ in Equation (17) to T_n,λ, we have

T_{n, λ} = \frac{1}{n s} ‖ K_{11} M^{- 1} (I_{n} - S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) {(f_{00} + f_{10} + f_{01} + f_{11} + ϵ) ‖}_{2}^{2},

where f₀₀, f₁₀, f₀₁ and f₁₁ are ns dimensional vectors with the ijth entry $f_{00} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ , $f_{10} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ , $f_{01} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ and $f_{11} (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ respectively, and ϵ is the ns dimensional stochastic error that follows a normal distribution with mean 0 and variance σ²I_ns. Because f₀₀, f₁₀ and f₀₁ are in the space that is orthogonal to the space spanned by K₁₁, and f₁₁ = 0 under the null hypothesis, T_n,λ can be further simplified as

T_{n, λ} = \frac{1}{n s} {‖ K_{11} M^{- 1} (I_{n s} - S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) ϵ ‖}_{2}^{2} .

(26)

A detailed discussion of this simplification will be provided in Lemma 12 in Appendix.

Next, we develop the null limiting distribution of T_n,λ as n goes to infinity. In the derivation, we only require the number of subjects s to be finite. This requirement is desired in real applications since the number of subjects in an experiment is usually limited. For example, due to the high sequencing cost, there are usually only tens of sample sequenced in the DNA methylation studies.

We consider the following two designs.

Quasi-Uniform Design : $x_{1}^{〈 1 〉}, x_{2}^{〈 1 〉}, \dots, x_{n}^{〈 1 〉} \overset{iid}{~} ω (x^{〈 1 〉})$ where ω is the marginal density of x^〈1〉. For any x^〈1〉 ∈ [0, 1], there exist two constants c₁, c₂ > 0 such that c₁ ≤ ω(x^〈1〉) ≤ c₂ (Eggermont and LaRiccia, 2001).

Uniform Design: $x_{1}^{〈 1 〉}, x_{2}^{〈 1 〉}, \dots, x_{n}^{〈 1 〉}$ are evenly spaced on [0, 1].

The above two designs are commonly used in scientific investigations. For example, in fMRI experiments, the sampling points on the time domain are usually measured with equal-time intervals. Thus, they are assumed to follow uniform design. On the other hand, the DNA methylation sites are randomly scattered on DNA sequence. Therefore, they are assumed to follow a quasi-uniform design.

Theorem 7 For both the uniform design and the quasi-uniform design, if the smoothing parameter $λ = O (n^{c - 1})$ for any fixed c ∈ (0, 1), we have

\frac{T_{n, λ} - μ_{n, λ}}{σ_{n, λ}} \overset{d}{\to} N (0, 1) as n \to \infty,

where μ_n,λ = σ² Tr(Δ)/(ns) and $σ_{n, λ}^{2} = 2 σ^{4} Tr (Δ^{2}) / {(n s)}^{2}$ with $Δ = M^{- 1} K_{11}^{2} M^{- 1}$ .

In practice, we estimate the variance σ² via ${\hat{σ}}^{2}$ defined as

{\hat{σ}}^{2} = \frac{y^{⊤} {(I - A (λ))}^{2} y}{Tr (I - A (λ))},

where A(λ) = K(nsK² + λQ)⁻¹y, and (I − A(λ))y is the residual $y - \hat{f}$ based on the objective function in equation (15). The consistency of the variance estimate ${\hat{σ}}^{2}$ is established in Theorem 3.4 in Gu (2013).

The proof of Theorem 7 is provided in Appendix and sketched below. Notice that T_n,λ = T₁ + T₂ − 2T₃, where

T_{1} = \frac{1}{n s} ϵ^{T} M^{- 1} K_{11}^{2} M^{- 1} ϵ, T_{2} = \frac{1}{n s} {‖ K_{11} M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1} ϵ ‖}_{2}^{2}, T_{3} = \frac{1}{n s} ϵ^{T} M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1} K_{11}^{2} M^{- 1} ϵ .

(27)

We show that T₂ and T₃ are higher order small perturbation terms compared to T₁. Thus, the null distribution of T_n,λ and the distribution of T₁ are asymptotically equivalent. We only need to focus on the distribution of the quadratic form $T_{1} = \frac{1}{n s} ϵ^{T} Δ ϵ$ with ϵ having a mean zero normal distribution. To prove the normality of T₁, we show that the log-characteristic function of the standardized T₁ is asymptotically −σ²t²/2, provided that Tr(Δ²) diverges as λ → 0. Lemma 15 shows that $Tr (Δ^{2}) ≽ {\hat{τ}}_{λ}$ , where ${\hat{τ}}_{λ} = max {i | {\hat{μ}}_{i} \geq λ}$ as the effective dimension (Bartlett et al., 2005; Liu et al., 2019) with ${\hat{μ}}_{1} \geq \dots \geq {\hat{μ}}_{n}$ the empirical eigenvalues of kernel matrix $K_{1}^{〈 1 〉}$ which is the kernel matrix of $H_{1}^{〈 1 〉}$ with (i, i′)th entry as $\frac{1}{n} K_{1}^{〈 1 〉} (x_{i}^{〈 1 〉}, x_{i^{'}}^{〈 1 〉})$ . We further show in Lemma 13 and 14 that ${\hat{τ}}_{λ}$ is of the same order as its population counterpart τ_λ defined as τ_λ = max{i | μ_i ≥ λ}, under both the quasi-uniform design and the uniform design, where μ₁ ≥ ⋯ ≥ 0 are a sequence of ordered eigenvalues satisfying $K_{1}^{〈 1 〉} (x, x^{'}) = \sum_{i = 1}^{\infty} μ_{i} ϕ_{i} (x) ϕ_{i} (x^{'})$ . Since μ_i has a polynomial decay rate i^−2m (Gu, 2013), we have $Tr (Δ^{2}) ≽ {\hat{τ}}_{λ} ≍ τ_{λ} ≍ λ^{- 1 / (2 m)}$ diverges as λ → 0. Consequently, the testing consistency in Theorem 7 holds.

Theorem 7 characterizes the distribution of the test statistic T_n,λ for $f \in H_{model}^{\infty} ⊖ H_{11}^{\infty}$ . The distribution turns out to be fairly simple and easy to calculate as the test statistic does not depend on any unknown nuisance functions such as f₀₀, f₁₀ and f₀₁. Critical value can be easily found based on the known null distribution $N (μ_{n, λ}, σ_{n, λ}^{2})$ . Consequently, one can make a statistical decision by comparing T_n,λ with the critical value. This nuisance-parameter free property is referred to as the “Wilks phenomenon” in statistics literature (Fan et al., 2001; Fan and Zhang, 2004).

4.2. Upper Bound of the Distingushiable Rate

Given type I error α, we show that our Wald-type testing rule $ϕ_{n, λ} = 1_{(| T_{n, λ} - μ_{n, λ} | \geq z_{α / 2} σ_{n, λ})}$ achieves the local minimax distinguishable rate. Without loss of generality, we assume $‖ f ‖_{H} \leq 1$ .

Theorem 8 Let the minimum distinguishable rate of the test ϕ_n,λ be d_n(ϕ_n,λ). Suppose $λ = O (n^{c - 1})$ for any fixed c ∈ (0, 1). Then for any δ > 0, there exist positive constants C_δ and N_δ such that, when n ≥ N_δ, the tolerable pseudo.risk(ϕ_n,λ, d_n) = α + δ, with $d_{n} (ϕ_{n, λ}) ≔ C_{δ} \sqrt{λ + σ_{n, λ}}$ .

Theorem 8 shows that for a controlled type I error, T_n,λ can achieve arbitrary small type II error provided that the local alternative is separated from the null by at least an amount of d_n(ϕ_n,λ). The proof of Theorem 8 is collected in Appendix.

Note that $d_{n}^{2} (ϕ_{n, λ})$ consists of two components: σ_n,λ representing the standard variation of the test statistic T_n,λ, and λ representing the squared bias of ${\hat{f}}_{1, 2}$ (see the proof of Lemma S.1 in the Supplementary). Through approximating σ_n,λ by the Rademacher complexity (Bartlett et al., 2005; Liu et al., 2019), we show that $σ_{n, λ} ≍ \sqrt{τ_{λ}} / n$ , which is a decreasing function of λ. Hence, the minimum distinguishable rate for ϕ_n,λ is achieved by the trade-off between the bias of ${\hat{f}}_{1, 2}$ and the standard derivation of T_n,λ, i.e., choosing appropriate λ such that λ ≍ σ_n,λ. Next, we prove that our proposed Wald-type test is minimax under two special design conditions: the quasi-uniform design and the uniform design in the next two corollaries.

Corollary 9 [Quasi-Uniform Design] Let λ ≍ n^−4m/(4m+1) and suppose x^〈1〉 follows a quasi-uniform design. We have

P (d_{n} (ϕ_{n, λ}) ≍ n^{- 2 m / (4 m + 1)}) \geq 1 - 4 exp (- n^{1 / (2 m + 1)}) .

Corollary 10 [Uniform Design] Let λ ≍ n^−4m/(4m+1), and suppose x^〈1〉 follows a uniform design, we have

d_{n} (ϕ_{n, λ}) ≍ n^{- 2 m / (4 m + 1)} a . s .

Corollaries 9 and 10 suggest that if λ ≍ n^−4m/(4m+1), our Wald-type test ϕ_n,λ can achieve the minimax distinguishable rate $d_{n}^{†} ≍ n^{- 2 m / (4 m + 1)}$ . Thus, we demonstrate that our proposed Wald type test is minimax optimal. We remark that Corollary 9 still holds when extending $H^{〈 1 〉}$ as a standard Sobolev space.

4.3. The Choice of Regularization Parameter

Different from the classical “bias-variance” tradeoff in optimal nonparametric estimation, Theorem 8 states that the optimal nonparametric testing for Equation (3) can be achieved by another type of tradeoff between the squared bias of the estimator and the standard deviation of the test statistic. Such intrinsic difference further leads to different orders of optimal regularization parameters: as shown in Corollary 9, 10, the optimal λ is chosen as the order of $n^{- \frac{4 m}{4 m + 1}}$ ; while as the order of $n^{- \frac{2 m}{2 m + 1}}$ for optimal estimation (Gu, 2013).

In practice, cross validation method is often used as a tuning procedure for nonparametric estimation based on penalized loss functions (Golub et al., 1979). Raskutti et al. (2014) proposed another data-dependent algorithmic regularization technique, that is, choosing an early stopping rule for an iterative algorithm to avoid over-fitting in nonparametric estimation. Both of the above approaches are optimal for estimation but suboptimal for testing. There has few theoretically justified tuning procedure for obtaining optimal testing in nonparametric inference. One related work we are aware currently is Liu and Cheng (2018), under which they developed a data-dependent early stopping regularization rule from an algorithmic perspective for testing f = 0 in nonparametric regression model Y = f(X) + ϵ. The total step size determined via the early stopping rule in gradient descent algorithm plays the same role with 1/λ in the penalized regularization, to avoid over-fitting. However, a data-adaptive choice of the regularization parameter λ is still lacking for nonparametric inference in Equation (3) under the penalization regularization.

We propose a data-adaptive method to choose λ with testing optimality guarantee based on Theorem 8. In practice, we can choose the optimal smoothing parameter λ* satisfying

λ * = min {λ | λ < σ_{n, λ}},

(28)

where σ_n,λ can be explicitly calculated based on the observed data by the expression defined in Theorem 7, i.e., $σ_{n, λ}^{2} = 2 σ^{4} Tr (Δ^{2}) / {(n s)}^{2}$ , with $Δ = M^{- 1} K_{11}^{2} M^{- 1}$ .

The above criterion in Equation (28) in choosing λ is a data-dependent rule that produces a minimax-optimal nonparametric testing method. Based on the Rademacher complexity, $σ_{n, λ} ≍ \frac{σ^{2}}{n s} \sqrt{\sum_{i = 1}^{n} min {1, {\hat{μ}}_{i} / λ}}$ . That is, the rule in Equation (28) depends on the eigenvalues of the kernel matrix, especially the first few leading eigenvalues. There are many efficient methods to compute the top eigenvalues fast (Drineas and Mahoney, 2005; Ma and Belkin, 2017). As a future work, we can also introduce the randomly projected kernel methods to accelerate the computing time.

5. Simulation Study

To assess the performance of our proposed test, we carried out extensive analyses on simulated data sets. We compared our approach with F-test (SSF) (Ma et al., 2009), parallelism trend test (PTT) (Degras et al., 2011) and a random permutation test with 500 permutations. In the three methods, permutation test can be used as a benchmark because it can closely approximate null distribution when the number of permutations is adequate. However, the permutation test is computationally intensive, especially for calculating the Kullback-Leibler distance under the null and alternative hypothesis for SSANOVA model (Gu, 2004).

5.1. Empirical Power Analysis

We illustrate the empirical power performance of our proposed test through four well-designed examples. In all four examples, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1), where $x_{i}^{〈 1 〉} \overset{iid}{~} U (0, 1)$ and $ϵ_{i j} \overset{iid}{~} N (0, 1)$ . Each example was repeated 500 times for power and other comparisons. To make the simulation more close to the reality, we considered two types of nonparallel patterns between f(x^〈1〉, 1) and f(x^〈1〉, 0): magnitude and frequency. These two kinds of nonparallel patterns are often observed in real applications. For example, the hypermethylated DNA regions, i.e., regions with low methylation levels, are related to transcriptional silencing which plays an important role in cancer development; the frequency differences are often related to different brain functions between the neurodisease and control groups in fMRI studies. In the first four examples, we consider the following function in Equation (1),

f (x^{〈 1 〉}, x^{〈 2 〉}) = {\begin{array}{l} 2.5 sin (3 π x^{〈 1 〉}) (1 - x^{〈 1 〉}) & if x^{〈 2 〉} = 0, i.e., control \\ (2.5 + δ_{1}) sin ((3 + δ_{2}) π x^{〈 1 〉}) {(1 - x^{〈 1 〉})}^{(1 + δ_{3})} & if x^{〈 2 〉} = 1, i.e., case \end{array}

(29)

where δ₁, δ₂ and δ₃ control the magnitude of nonparallelism between the null hypothesis and the alternative hypothesis in Equation (18). In general, varying δ₁, δ₂ and δ₃ give rise to different distinguishable rates d_ns. The larger the δ₁, δ₂ and δ₃ are, the larger the d_n is. To illustrate how the testing power is affected by different δ’s, as shown in Figure 3, we considered the following four settings. Setting 1: Case and control have constant magnitude differences (δ₁ = 0.50, 0.75, 1.00 and δ₂, δ₃ = 0.00); Setting 2: Case and control have frequency differences (δ₂ = 0.20, 0.30, 0.40 and δ₁, δ₃ = 0.00); Setting 3: Both magnitude and frequency are different (δ₁, δ₂ = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40) and δ₃ = 0.00); Setting 4: Case and control have non-constant magnitude differences (δ₁, δ₂ = 0.00 and δ₃ = 0.50, 0.75, 1.00). The corresponding functions f(x^〈1〉, 0) and f(x^〈1〉, 1) are shown in Figure 3.

Figure 3: — Plotted here are functions of the control group (solid line) and case group (dashed, dotted and dot-dash lines) with four types of nonparallel patterns: magnitude differences only (Setting 1), frequency differences only (Setting 2), both magnitude and frequency differences (Setting 3), and magnitude dynamic differences (Setting 4).

The empirical powers of our proposed Wald-type test, permutation test, SSF test and PTT test are summarized in Tables 1–2 for Settings 1–2. For Setting 1, as shown in Table 1, the empirical power of our test increases rapidly as sample size increases, and approaches to 1 even for the smallest magnitude (δ₁ = 0.50). The empirical powers of the proposed test are comparable with that of the permutation test. In contrast, the empirical powers of SSF and PTT increase slower than our proposed test. For the weak signal scenario, i.e., δ₁ = 0.50, the proposed test has significantly gain of power under different sample sizes. For the strong signal scenario, i.e, δ = 1.00, our proposed test is significantly more powerful than SSF and PTT when sample size is less than 500. For Setting 2, as shown in Table 2, the empirical power of our proposed test converges to 1 as the sample size increases for all three cases with δ₂ = 0.20, 0.30 and 0.40. In contrast, the empirical power of SSF and PTT converges to 1 slower than the proposed test.

Table 1:

Table lists the empirical power of our proposed test and permutation test for Setting 1 with δ₁ = 0.50, 0.75, 1.00, δ₂ = δ₃ = 0.00 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₁ = 0.50	Proposed	0.17	0.33	0.49	0.59	0.69	0.75	0.86	0.91	0.92	0.96
	Permutation	0.19	0.38	0.53	0.60	0.62	0.76	0.80	0.88	0.94	0.97
	SSF	0.02	0.09	0.11	0.16	0.26	0.28	0.36	0.54	0.58	0.72
	PTT	0.05	0.06	0.05	0.1	0.11	0.1	0.14	0.21	0.11	0.17
δ₁ = 0.75	Proposed	0.37	0.67	0.90	0.93	0.97	0.98	1.00	1.00	1.00	1.00
	Permutation	0.38	0.66	0.81	0.90	0.96	0.99	0.99	1.00	1.00	1.00
	SSF	0.04	0.21	0.37	0.50	0.81	0.86	0.91	0.96	0.97	0.98
	PTT	0.09	0.14	0.15	0.33	0.38	0.36	0.47	0.55	0.44	0.54
δ₁ = 1.00	Proposed	0.61	0.92	0.97	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	Permutation	0.57	0.89	0.95	0.99	1.00	1.00	1.00	1.00	1.00	1.00
	SSF	0.14	0.48	0.79	0.90	0.97	0.99	1.00	1.00	1.00	1.00
	PTT	0.08	0.23	0.42	0.43	0.54	0.62	0.77	0.77	0.79	0.85

Open in a new tab

Table 2:

Table lists the empirical power of our proposed test and permutation test for Setting 2 with δ₂ = 0.20, 0.30, 0.40, δ₁ = δ₃ = 0.00 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₂ = 0.20	Proposed	0.28	0.46	0.66	0.79	0.86	0.95	0.95	0.97	0.98	0.99
	Permutation	0.27	0.43	0.59	0.74	0.86	0.94	0.94	0.98	1.00	1.00
	SSF	0.02	0.05	0.21	0.32	0.48	0.62	0.79	0.84	0.88	0.95
	PTT	0.04	0.03	0.04	0.08	0.11	0.14	0.12	0.09	0.16	0.26
δ₂ = 0.30	Proposed	0.40	0.63	0.81	0.94	0.96	0.99	0.99	1.00	1.00	1.00
	Permutation	0.36	0.64	0.79	0.89	0.97	0.98	0.99	1.00	1.00	1.00
	SSF	0.03	0.13	0.35	0.52	0.72	0.85	0.91	0.97	0.99	1.00
	PTT	0.03	0.08	0.09	0.15	0.31	0.23	0.28	0.4	0.35	0.4
δ₂ = 0.40	Proposed	0.73	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	Permutation	0.78	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	SSF	0.24	0.74	0.98	0.99	1.00	1.00	1.00	1.00	1.00	1.00
	PTT	0.11	0.16	0.18	0.38	0.39	0.52	0.56	0.59	0.81	0.89

Open in a new tab

For Settings 3 and 4, we only included the empirical results for our proposed test and SSF test due to the extremely high computational cost of the permutation test. As shown in Table 6, it takes more than 150 hours to complete the permutation test for one setting. For Setting 3, we simulated the signal with differences in both scale and frequency across case and control groups. The empirical powers of the simulation with different distinguishable parameters are listed in Table 3. The empirical powers of our proposed test and SSF increase for all the three cases with δ₁, δ₂ = 0.20, 0.30, 0.40. The empirical power of PTT also increases, but with a much slower pattern. When the sample size is small and signal strength is weak, our proposed test has significant gain of power compared to the SSF and PTT test. For Setting 4, there is a nonlinear magnitude difference along the x^〈1〉 between the two groups. As shown in Table 4, the empirical power of SSF test converges to one slower than the proposed test and is lower than 0.65 for the least distinguishable case.

Table 6:

Table lists computational time (in hour) of running the simulation with 500 replications for our proposed test and the permutation test.

	Sample Size
	100	200	300	400	500	600	700	800	900	1000
Proposed	0.01	0.03	0.04	0.06	0.07	0.09	0.10	0.12	0.14	0.16
Permutation	3.22	6.14	9.29	13.29	17.93	22.26	26.74	31.26	36.57	42.23

Open in a new tab

Table 3:

Table lists the empirical power of our proposed test and permutation test for Setting 3 with δ₁, δ₂ = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40), δ₃ = 0 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₁ = 0.50	Proposed	0.35	0.51	0.74	0.86	0.91	0.95	0.97	0.98	1.00	1.00
δ₂ = 0.20	SSF	0.04	0.15	0.29	0.41	0.57	0.72	0.85	0.89	0.91	0.96
	PTT	0.03	0.07	0.07	0.08	0.08	0.06	0.15	0.19	0.21	0.2
δ₁ = 0.75	Proposed	0.42	0.70	0.86	0.96	0.99	1.00	1.00	1.00	1.00	1.00
δ₂ = 0.30	SSF	0.05	0.26	0.46	0.64	0.79	0.93	0.94	0.95	1.00	1.00
	PTT	0.04	0.07	0.11	0.15	0.19	0.23	0.31	0.29	0.43	0.46
δ₁ = 1.00	Proposed	0.72	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
δ₂ = 0.40	SSF	0.25	0.72	0.97	0.99	1.00	1.00	1.00	1.00	1.00	1.00
	PTT	0.11	0.19	0.22	0.32	0.52	0.5	0.64	0.61	0.73	0.69

Open in a new tab

Table 4:

Table lists the empirical power of our proposed test and permutation test for Setting 4 with δ₃ = 0.50, 0.75, 1.00, δ₁ = δ₂ = 0.00 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₃ = 0.50	Proposed	0.15	0.33	0.47	0.58	0.66	0.75	0.83	0.88	0.89	0.94
	SSF	0.01	0.04	0.07	0.16	0.18	0.28	0.35	0.47	0.57	0.64
	PTT	0.06	0.03	0.08	0.09	0.09	0.14	0.07	0.13	0.08	0.13
δ₃ = 0.75	Proposed	0.35	0.61	0.73	0.84	0.92	0.95	0.99	1.00	1.00	1.00
	SSF	0.03	0.12	0.18	0.34	0.56	0.70	0.83	0.86	0.96	0.96
	PTT	0.01	0.07	0.06	0.07	0.09	0.12	0.13	0.18	0.18	0.24
δ₃ = 1.00	Proposed	0.42	0.70	0.85	0.95	0.99	0.99	1.00	1.00	1.00	1.00
	SSF	0.07	0.20	0.52	0.76	0.82	0.92	0.98	0.98	1.00	1.00
	PTT	0.09	0.04	0.08	0.10	0.18	0.18	0.21	0.24	0.25	0.28

Open in a new tab

5.2. Empirical Size Analysis

To examine the approximation of significance levels, we generated data from a new setting Setting 5. We kept the function form of control group the same as Equation (5.4) and only added a parallel shift over the control function as the function of the case group, i.e., the model does not include the nonparallel patterns. In particular,

f (x^{〈 1 〉}, x^{〈 2 〉}) = 2.5 sin (3 π x^{〈 1 〉}) (1 - x^{〈 1 〉}) + δ_{4} I_{{x^{〈 2 〉} = 1}},

where δ₄ was set to be 0, 0.5 and 1 to characterize different level parallel difference in the two groups. We generated data from Equation (1) with function f specified in Setting 5. The rest of parameters were set the same as before.

Table 5 lists the empirical sizes of our proposed test, permutation test, SSF test, and PTT under Setting 5. We varied δ₄ from 0.00 to 1.00 to model different magnitudes of the main effect. The empirical size of our proposed test approaches to 0.05 as the sample size increases for different values of δ₄. The empirical size of SSF test is fluctuating from 0.03 to 0.1. The inaccurate size of the SSF test may be attributed to the fact that the degrees of freedom of the SSF test is very roughly approximated by the rounding value of the trace of the smoothing matrix. The empirical size of PTT test is fluctuating from 0.02 to 0.12.

Table 5:

Table lists the empirical sizes of the proposed test, permutation test, SSF, and PTT for δ₄ = 0.00, 0.50, 1.00 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₄ = 0.00	Proposed	0.04	0.07	0.06	0.06	0.05	0.06	0.06	0.07	0.06	0.05
	Permutation	0.04	0.08	0.05	0.08	0.06	0.05	0.06	0.04	0.07	0.06
	SSF	0.06	0.11	0.03	0.08	0.08	0.03	0.07	0.09	0.07	0.03
	PTT	0.03	0.05	0.02	0.02	0.03	0.02	0.12	0.09	0.08	0.06
δ₄ = 0.50	Proposed	0.06	0.05	0.05	0.06	0.06	0.05	0.06	0.04	0.05	0.06
	Permutation	0.07	0.04	0.05	0.06	0.08	0.09	0.04	0.03	0.05	0.04
	SSF	0.06	0.05	0.07	0.06	0.07	0.08	0.07	0.04	0.04	0.07
	PTT	0.02	0.02	0.03	0.03	0.07	0.04	0.06	0.07	0.06	0.04
δ₄ = 1.00	Proposed	0.07	0.06	0.07	0.06	0.05	0.05	0.06	0.06	0.06	0.05
	Permutation	0.04	0.06	0.03	0.05	0.05	0.04	0.03	0.02	0.04	0.04
	SSF	0.07	0.07	0.08	0.06	0.04	0.07	0.06	0.09	0.07	0.04
	PTT	0.03	0.04	0.03	0.05	0.05	0.04	0.06	0.05	0.05	0.08

Open in a new tab

5.3. Computation Time

As shown in Tables 1 and 2, our purposed test achieves the power similar to the permutation test. Next, we compared the computation time of our proposed test and permutation test for 500 replicated samples. We conducted the comparison on a computer workstation with core Intel i7 8700k CPU and 32 Gb RAM. In Table 6, we reported the computational time in Setting 1 with δ₁ = 0.5 and sample size ranging from 100 to 1000. As shown in Table 6, our proposed test is consistently faster than the permutation test. Our proposed test is nearly 263×faster than the permutation test when the sample size is 1000. Note that the computational time is more than 42 hours when the sample size is 1000 for running 500 test. In practice, the huge computational cost limits the application of the permutation test in many large scale studies involving large sample size and multiple tests.

5.4. Simulation Studies with Correlated Noise

We established Setting 6 to evaluate the performance of the proposed test when the noises are correlated. In this example, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1). We considered $x_{i}^{〈 1 〉}$ , i = 1, …, n are evenly distributed in [0, 1]. We generated two correlated noise vector (ϵ₁₁, …, ϵ_n1) and (ϵ₁₂, …, ϵ_n2) i.i.d. from N(0, Σ) where Σ is autoregressive, i.e., each of its element σ_ii′ = ρ^|i−i′| with ρ = 0.5. We generated the signal $Y_{i j} = f (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + ϵ_{i j}$ where f is defined in Equation (5.4) with δ₁ = 0.00, 0.50, 0.75, 1.00 and δ₂, δ₃ = 0.00, that is,

f (x^{〈 1 〉}, x^{〈 2 〉}) = {\begin{array}{l} 2.5 sin (3 π x^{〈 1 〉}) (1 - x^{〈 1 〉}) & if x^{〈 2 〉} = 0, \\ (2.5 + δ_{1}) sin (3 π x^{〈 1 〉}) (1 - x^{〈 1 〉}) & if x^{〈 2 〉} = 1. \end{array}

We set the significance level as 0.05 and repeated 500 times for evaluating the empirical size and power.

As shown in Table 7, when δ₁ = 0.00, the size of our proposed method concentrates around 0.05 − 0.07, while the sizes of SSF and PTT are fluctuating from 0.02 to 0.16. When δ₁ > 0.00, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as δ₁ increases.

Table 7:

Table lists the empirical size (δ₁ = 0) and power (δ₁ = 0.50, 0.75, 1.00) of our proposed test, SSF and PTT for Setting 6 with δ₂ = δ₃ = 0.00 and sample size ranging from 100 to 1000.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₁ = 0.00	Proposed	0.08	0.04	0.06	0.06	0.06	0.08	0.10	0.07	0.06	0.07
	SSF	0.08	0.06	0.06	0.09	0.10	0.05	0.09	0.14	0.08	0.06
	PTT	0.02	0.1	0.04	0.08	0.11	0.05	0.16	0.12	0.13	0.06
δ₁ = 0.50	Proposed	0.21	0.33	0.48	0.57	0.73	0.73	0.82	0.91	0.94	0.96
	SSF	0.01	0.05	0.10	0.17	0.29	0.32	0.46	0.48	0.63	0.72
	PTT	0.13	0.22	0.35	0.50	0.51	0.53	0.72	0.73	0.78	0.86
δ₁ = 0.75	Proposed	0.66	0.89	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	SSF	0.13	0.48	0.74	0.89	0.98	1.00	0.99	1.00	1.00	1.00
	PTT	0.16	0.32	0.41	0.43	0.66	0.67	0.73	0.85	0.85	0.89
δ₁ = 1.00	Proposed	0.93	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	SSF	0.47	0.93	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	PTT	0.16	0.41	0.55	0.62	0.69	0.85	0.86	0.90	0.90	0.95

Open in a new tab

5.5. Simulation Studies with Non-smooth Cases

We evaluate the robustness of the proposed method when the smoothness assumption is invalid. We established Setting 7 to test the performance of the proposed test for the cases with non-smooth trends. In this setting, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in model (1). We considered $x_{i}^{〈 1 〉}$ , i = 1, …, n, are evenly distributed in [0, 1] and $ϵ_{i j} \overset{iid}{~} N (0, 1)$ . We generated the signal $Y_{i j} = f (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) + ϵ_{i j}$ with f defined as

f (x^{〈 1 〉}, x^{〈 2 〉}) = 2.5 sin (2 π x^{〈 1 〉}) I_{{x^{1} \in (0, 0.5)}} + (1 + δ_{5} I_{{x^{〈 2 〉} = 1}}) (x - 1) I_{{x^{〈 1 〉} \in [0.5, 0)}}

which is shown in Figure 4. This curve is non-differentiable at x^〈1〉 = 0.5 which is a change point from nonlinear to linear trend. We set the significance level as 0.05 and repeated 500 times to evaluate the empirical size and power.

As shown in Table 8, when δ₅ = 0.00, the empirical size of our proposed method concentrates around 0.05. The empirical size of our proposed method is slightly inflated compared with SSF and PTT. When δ₅ = 1, 2, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as n increases.

Table 8:

Table lists the empirical size (δ₅ = 0) and power (δ₅ = 1.00, 2.00) of our proposed test, SSF and PTT for Setting 7.

		Sample Size
		100	200	300	400	500	600	700	800	900	1000
δ₅ = 0.00	Proposed	0.08	0.04	0.06	0.06	0.06	0.08	0.10	0.07	0.06	0.07
	SSF	0.01	0.01	0.00	0.00	0.00	0.01	0.01	0.01	0.00	0.00
	PTT	0.02	0.06	0.04	0.08	0.03	0.05	0.03	0.06	0.03	0.03
δ₅ = 1.00	Proposed	0.21	0.33	0.48	0.57	0.73	0.73	0.82	0.91	0.94	0.96
	SSF	0.03	0.02	0.06	0.07	0.09	0.15	0.23	0.29	0.34	0.37
	PTT	0.01	0.05	0.01	0.02	0.04	0.02	0.04	0.04	0.08	0.06
δ₅ = 2.00	Proposed	0.66	0.89	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	SSF	0.07	0.24	0.46	0.63	0.76	0.87	0.91	0.98	0.97	0.99
	PTT	0.02	0.04	0.02	0.09	0.03	0.06	0.07	0.10	0.06	0.06

Open in a new tab

6. Real Data Examples

We apply the technique to analyze two real data sets: DNA methylation in chronic lymphocytic leukemia and neuroimaging of Alzheimer’s Disease using fMRI.

6.1. DNA Methylation in Chronic Lymphocytic Leukemia

Recently, Filarsky et al. (2016) reported a DNA methylation study for chronic lymphocytic leukemia (CLL) patients. In the study, the DNA samples were extracted from CD19+ cells from 12 CLL patients and B cells from 6 normal subjects. The DNA methylation is profiled by the whole-genome tiling array technique. The goal is to identify differentially methylated regions (DMRs), i.e., the genome regions that have significantly different methylation levels, between CLL patients and normal subjects.

To achieve this goal, we compiled the DNA methylation intensities within the −3.8 to +1.8 kb of transcription start sites (TSS) for each gene. We used the M-value suggested by Irizarry et al. (2008) as methylation level at each site and as our response variable. In particular, the data consists of $(Y_{i j}, x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉})$ , where Y_ij is the methylation level at the ith genome location $x_{i}^{〈 1 〉}$ of the jth subject in group $x_{j}^{〈 2 〉}$ , which equals to 1 if the jth subject is in the case group and equals to 0 if the jth subject is in the control group. We fit the model in Equation (1) with SSANOVA decomposition in Equation (2) to the data.

We applied the proposed hypothesis testing on 10383 regions. Through controlling FDR < 0.01 using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), we selected 613 DMRs. We conducted gene ontology analysis on the 613 genes corresponding 613 identified DMRs using the GSEA (Subramanian et al., 2005). Among these genes, 79 genes participate the lipid metabolic process, which plays an important role in the development of CLL (Pallasch et al., 2008). This biological process contributes to apoptosis resistance in CLL cells. Furthermore, 78 and 61 genes participate the immune related biological processes: “Immune system process” and “Regulation of immune system process” respectively. The observation indicates that the aberrant DNA methylation has the potential impact on the immune system.

Our Wald-type test, even after FDR control, yields p-values that are as small as 10⁻⁹. Consequently, it is very difficult to compare our test with the permutation test with only hundreds or thousands of permutations. Thus, we only compared our proposed test with permutation test (based on 500 permutation) for regions with p-values larger than 0.05 the averaged difference between our test and permutation test is 0.012.

We highlighted two DMRs with significant nonparallel patterns in Figure 5. The focal hypermethylation at genome locations 42574000 and 42576500 are observed on the promoter region of gene MTA3. It was reported in (Bilban et al., 2006) that MTA3 signaling pathway is a potential bio-marker for CLL and shows significantly altered gene expression. Our test also identified that the methylation levels between CLL patients and normal subjects, of MTA3 gene have significant difference, which has potential prognostic value. In the promoter region of DNMT3, we observed significant hypomethylation at genome location 25244500. DNMT3 is a family of DNA methyltransferases that could methylate hemimethylated and unmethylated CpG sites at the same rate (Okano et al., 1998). Since the global hypomethylation is observed, the aberrant methylation levels of this DNA methylatransferase may have influence on this global trend.

6.2. Neuroimaging of Alzheimer’s Disease using fMRI

Alzheimer’s disease (AD) is one of the most commonly known neurology disease characterized with neurodegeneration and cognitive decline (Rombouts et al., 2005; Wang et al., 2006). Despite the prevalence of AD, there are no cure or preventive methods available due to the lack of a complete understanding of the mechanisms that contribute to AD pathophysiology. Discovering aberrant neural network of AD will fundamentally advance the scientific understanding of this disease.

In this study, we analyzed the data that was collected by Alzheimer’s Disease Neuroimaging Initiative (ADNI)², in which the resting-state fMRI signals of 60 normal/early-mild-cognitive-impairment subjects (control group) and 50 AD/late-mild-cognitive-impairment subjects (AD group) were collected from 256×256×170 voxels for 140 consecutive time points with equal time intervals of 30ms. The fMRI signals for each subject were preprocessed using fMRI Expert Analysis Tool (FEAT) (Smith et al., 2004) for skull-stripping, motion correction, slice timing correction, temporal filtering, spatial smoothing and registration to standard space (MNI152 T1 2mm model) so that signals from all subjects can be considered as from the same engineered brain template. Sixty-nine brain-region-of-interests (ROI) that are defined by Harvard-Oxford-Atlas (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases) was extracted by automatic regional labeling approach using the refined fMRI data. For each ROI, we consider model (1) with SSANOVA decomposition in Equation (2), where Y_ij records the average blood-oxygen-level (Huettel et al., 2004) of the brain region for subject j measured at the $x_{i}^{〈 1 〉}$ time point. As the blood-oxygen-level can accurately quantify the corresponding brain activity, we can detect abnormal AD related brain activity. Testing problem in Equation (18) is equivalent to testing whether the brain activities of a given ROI have different temporal patterns in case and control groups.

Seven cortical regions parahippocampal gyrus, cingulate gyrus, inferior temporal gyrus, post-central gyrus, juxtapositional lobule cortex, precuneous cortex, central opercular cortex and one sub-cortical region right thalamus with significantly different temporal patterns were identified using our test with the false discovery rate controlled at 5% using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Among the eight ROIs, parahippocampal gyrus and cingulate gyrus have been shown clinically to be risk factors for AD. As demonstrated in Echávarri et al. (2011) and Kesslak et al. (1991), parahippocampal gyrus of AD patients have significant atrophy. Meanwhile, cingulate gyrus was also found to be AD related (Scheff et al., 2015) due to its extensive connectivity with multiple different cortical areas, especially areas involved with learning and memory. In Figure 6, we plotted frontal, axial, and lateral views and corresponding temporal patterns of parahippocampal gyrus and cingulate gyrus. The temporal regions with significant difference between AD/late-mildcognitive-impairment subjects (red line) and normal/early-mild-cognitive-impairment subjects (blue line) are highlighted. As clearly demonstrated in lower left panel of Figure 6, the first highlighted area of parahippocampal gyrus has a significant reversed pattern between case group and control group. The second highlighted area shows the reduced levels for the AD group. For cingulate gyrus, the highlighted regions in the right panel of Figure 6 show clearly larger magnitude for the AD groups. This difference was also observed via fMRI in a visual encoding memory task (Rami et al., 2012). Both of the two experiments suggest that the difference may change the memory function.

Figure 6: — Plotted here are blood-oxygen-levels of *parahippocampal gyrus* (left) and *cingulate gyrus* (right) for control group (blue) and AD group (red) observed at 140 time points. Physical locations of either ROIs on frontal, axial and lateral sides are illustrated on the top of each panel.

7. Discussion

The hypothesis testing in SSANOVA is a very challenge problem. In this paper, we develop a Wald-type test for testing the significance of the nonparallelism in a two-way SSANOVA model. The optimality of the proposed test is justified by the minimax distinguishable rate. The extensive empirical studies suggest that the proposed test has a superior performance over existing methods. Although we only discuss the test of the significance of the nonparallelism in a two-way SSANOVA model, the test on a higher order SSANOVA model can be developed parallel to our framework.

Supplementary Material

NIHMS1641658-supplement-1.pdf^{(246.1KB, pdf)}

Acknowledgments

PM was funded in part by NSF DMS-1440037, 1438957, 1925066 and NIH 1R01GM122080-01. WZ was funded in part by NSF DMS-1440038, 1903226, 1925066 and NIH 1R01GM113242-01. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Appendix A. Proof of Main Results

In this section, we present main proofs of the theorems and lemmas in the main text.

A.1. Notation Clarification

We rewrite (16) as

{\hat{f}}_{11} = K_{11} M^{- 1} (I_{n s} - S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) y,

where

S = [\begin{matrix} I_{n} \otimes 1_{w_{0}} & 0 \\ 0 & I_{n} \otimes 1_{w_{1}} \end{matrix}] [\begin{matrix} 1_{n} & 1_{n} \\ 1_{n} & 0 \end{matrix}], K_{11} = \frac{1}{2} [\begin{matrix} K_{1}^{〈 1 〉} & - K_{1}^{〈 1 〉} \\ - K_{1}^{〈 1 〉} & K_{1}^{〈 1 〉} \end{matrix}]

and

M = [\begin{matrix} I_{n} \otimes 1_{w_{0}} & 0 \\ 0 & I_{n} \otimes 1_{w_{1}} \end{matrix}] (\frac{θ_{10}}{2} [\begin{matrix} K_{1}^{〈 1 〉} & K_{1}^{〈 1 〉} \\ K_{1}^{〈 1 〉} & K_{1}^{〈 1 〉} \end{matrix}] + \frac{θ_{11}}{2} [\begin{matrix} K_{1}^{〈 1 〉} & - K_{1}^{〈 1 〉} \\ - K_{1}^{〈 1 〉} & K_{1}^{〈 1 〉} \end{matrix}] + λ I_{2 n}) [\begin{matrix} I_{n} \otimes 1_{w_{0}}^{T} & 0 \\ 0 & I_{n} \otimes 1_{w_{1}}^{T} \end{matrix}],

$K_{1}^{〈 1 〉}$ is the kernel matrix of $H_{1}^{〈 1 〉}$ with (i, i′)th entry as $\frac{1}{n} K_{1}^{〈 1 〉} (x_{i}^{〈 1 〉}, x_{i^{'}}^{〈 1 〉})$ , w₀ is the number of subjects in control group, w₁ is the number of subjects in case group, and ⊗ denotes the Kronecker product. Based on Chapter A.3 in Gu (2013), we set $θ_{10}^{- 1} \propto Tr (K_{10})$ and $θ_{11}^{- 1} \propto Tr (K_{11})$ with θ₁₀ + θ₁₁ = 1.

In the following theoretical derivation, we only focus on the case with s = 2, i.e. w₀ = w₁ = 1. If we have s > 2 subjects, the proof can be easily generalized to this situation by replacing Equation (15) by the penalized weighted least squares; see Section 3.2.4 in Gu (2013).

A.2. Proofs for Section 3

A.2.1. Preliminary

We identify a sequence model that is equivalent to our nonparametric model (1) with SSANOVA decomposition in Equation (2). Let ${ρ_{i}, ϕ_{i}}_{i = 1}^{\infty}$ be pairs of eigenvalue and eigenfunction in $H^{〈 1 〉}$ and ${ν_{j}, ψ_{j}}_{j = 1}^{2}$ be pairs of eigenvalue and eigenfunction in $H^{〈 2 〉}$ . In the tensor product space $H = H^{〈 1 〉} \otimes H^{〈 2 〉}$ , as shown in Lin (2000), eigenvalues and eigenfunctions are {μ_iν_j, ϕ_iψ_j}_{i=1,…,∞, j=1,2}. Model (1) is equivalent to a sequence model

z_{i j} = θ_{i j} + ω_{i j},

(30)

where $θ_{i j} = \frac{1}{2} \sum_{x^{〈 2 〉} = 0}^{1} \int_{X_{1}} f (x_{i}^{〈 1 〉}, x_{j}^{〈 2 〉}) ϕ_{i} (x^{〈 1 〉}) ψ_{j} (x^{〈 2 〉}) d ω (x^{〈 1 〉})$ are the basis expansion coefficients, the random noise ω_ij is mean zero and variance σ²/n. The space $E = {f \in H : ‖ f ‖_{H} < 1}$ in Equation (30) is equivalent to $E = {\sum_{i = 1}^{\infty} \sum_{j = 1}^{2} \frac{θ_{i j}^{2}}{(μ_{i} ν_{j})} \leq 1}$ . The hypothesis in Equation (18) is equivalent to the hypothesis

H_{0} : θ_{i 2} = 0 for i = 2, \dots, n .

Let θ₁₁ = (θ₂₂, θ₃₂, …, θ_n2)^T, and $E_{11} = {θ_{11} | \sum_{i = 2}^{n} \frac{θ_{i 2}^{2}}{(μ_{i} ν_{2})} \leq 1}$ . Consider a local alternative $H_{1 n} : θ_{11} \in E_{11}$ with ∥θ₁₁∥₂ ≥ d_n, where d_n represents a generic distinguishable rate. The total error of a generic testing rule ϕ_n under distinguishable rate d_n can be rewritten as

pseudo.risk (ϕ_{n}, d_{n}) = E_{H_{0}} {ϕ_{n} | H_{0} is true} + sup_{\begin{array}{l} θ_{11} \in E_{11} \\ {‖ θ_{11} ‖}_{2} \geq d_{n} \end{array}} E {1 - ϕ_{n} | H_{1} is true} .

(31)

Equation (31) is consistent with the testing error defined by Ingster (1993), Wei and Wainwright (2020). For the simplicity of description, we order the axis length ${(μ_{i} ν_{2})}_{i = 2}^{\infty}$ from the smallest to the largest as ${ρ_{p}}_{p = 1}^{\infty}$ . Next we introduce a lemma to give a low bound of the minimum pseudo risk.

Lemma 11 For every set C and probability measure Q supported on $C \cap B^{c} (d_{n})$ , we have

inf_{ϕ_{n}} pseudo.risk (ϕ_{n}, d_{n}) \geq 1 - \frac{1}{2} \sqrt{E_{η, η^{'}} exp (\frac{〈 η, η^{'} 〉}{σ^{2}}) - 1}

where $E_{η, η^{'}}$ denotes expectation with respect to an i.i.d. pair η, $η^{'} ~ ℚ$

The proof of this lemma directly follows Lemma 3 in Wei and Wainwright (2020).

A.2.2. Proof of Lemma 4

Proof As shown in Lemma 11, we have

inf_{ϕ_{n}} pseudo.risk (ϕ_{n}, d_{n}) \geq 1 - \frac{1}{2} \sqrt{E_{η, η^{'}} exp (\frac{〈 η, η^{'} 〉}{σ^{2}}) - 1}

(32)

Next we show that if $δ^{2} \leq \frac{\sqrt{k_{B} (δ)} σ^{2}}{4}$ , we have the last term in Equation (32) larger than 1/2. Let $θ_{b} = \frac{δ}{\sqrt{k}} \sum_{i = 1}^{k} b_{i} e_{i}$ where e_i is the standard basis vector with ith coordinate as one. We consider $Q$ as the uniform distribution on {θ_b, b ∈ {−1, 1}^k}. The expectation in the last term of Equation (32) can be written as

E_{n η, η^{'}} exp (\frac{n 〈 η, η^{'} 〉}{σ^{2}}) = \frac{1}{2^{k}} \sum_{b, b^{'}} exp (\frac{n θ_{b}^{T} θ_{b^{'}}}{σ^{2}}) = \frac{1}{2^{k}} \sum_{b, b^{'}} exp (\frac{n δ^{2} \sum_{i = 1}^{k} b_{i} b_{i}^{'}}{k σ^{2}}) = \frac{1}{2^{k}} {(exp (\frac{n δ^{2}}{k σ^{2}}) + exp (- \frac{n δ^{2}}{k σ^{2}}))}^{k} \overset{(i)}{\leq} {(1 + \frac{n^{2} δ^{4}}{k^{2} σ^{4}})}^{k} \overset{(i i)}{\leq} exp (\frac{n^{2} δ^{4}}{k σ^{4}}),

where (i) is due to that $\frac{1}{2} (exp (x) + exp (- x)) \leq 1 + x^{2}$ for |x| ≤ 1/2 and (ii) is due to that 1+x ≤ e^x. Thus for any $δ^{4} \leq \frac{k σ^{4}}{16 n^{2}}$ , we have

inf_{ϕ_{n}} pseudo.risk (ϕ_{n}, d_{n}) \geq 1 - \frac{1}{2} \sqrt{e^{1 / 16} - 1} \geq 1 / 2.

By the definition of r_B, we have pseudo.risk(ϕ_n, d_n) > 1/2 for all d_n ≤ r_B. ■

A.2.3. Proof of Lemma 5

Proof We show that $b_{k, 2} (E_{11})$ is bounded below by $\sqrt{ρ_{k + 1}}$ . It is sufficient to show that $E_{11}$ contains a l₂ ball centered at f₁₁ = 0 with radius $\sqrt{ρ_{k + 1}}$ . For any $v \in E_{11}$ with $‖ v ‖_{2} \leq \sqrt{ρ_{k + 1}}$ , we have

b_{2, k} \overset{(i)}{\leq} \sum_{i = 1}^{k + 1} \frac{v_{i}^{2}}{ρ_{i}} \overset{(i i)}{\leq} \frac{1}{μ_{k + 1}} \sum_{i = 1}^{k + 1} v_{i}^{2},

where inequality (i) holds by set the (k + 1)-dimensional subspace spaned by the eigenvectors corresponding to the first (k + 1) largest eigenvalues; inequality (ii) holds by the decreasing order of the eigenvalues, i.e., ρ₁ ≥ ρ₂ ≥ … ρ_k+1.

Recall that the definition of the Bernstein lower critical dimension is $k_{B} (δ) = arg {max}_{k} {b_{k - 1, 2}^{2} (E_{11}) \geq δ^{2}}$ , we have

k_{B} (δ) \geq \underset{k}{arg max} {\sqrt{ρ_{k}} \geq δ} .

A.2.4. Proof of Theorem 6

Proof By Lemme 4, we have

d_{n} \leq sup {δ : k_{B} (δ) \geq 16 n^{2} δ^{4}} .

We plug in the lower bound of k_B(δ) in Lemma 5. Then we have

d_{n} \leq sup {δ : \underset{k}{arg max} {\sqrt{ρ_{k}} \geq δ} \geq 16 n^{2} δ^{4}} .

(33)

The eigenvalues have polynomial decay rate i.e., ρ_p ≍ p^−2m, and consequently, $arg {max}_{k} {\sqrt{ρ_{k}} \geq δ} ≍ δ^{- 1 / m}$ . Plugging this into Equation (33), it is easy to see that the supremum on the right hand side has an order $n^{- \frac{2 m}{4 m + 1}}$ . Proof is thus completed. ■

A.3. Proof of Theorem 7

Before deriving the proof of Theorem 7, we first state Lemma 12, Lemma 13, Lemma 14, and Lemma 15, which are used in the proof of Theorem 7. The proof of these auxiliary lemmas is referred to the Supplementary.

A.3.1. Some Auxiliary Lemmas

Lemma 12 shows the projection of f₁₀ on $H_{11} \cap H_{model}$ is zero. This result indicates our test statistic does not depend on the nuisance parameter f₁₀.

Lemma 12 The quantity, K₁₁M⁻¹(I_n − S(S^T M⁻¹S)⁻¹S^T M⁻¹)f₁₀, equals to zero.

The next two lemmas show the equivalence of τ_λ and ${\hat{τ}}_{λ}$ under the quasi-uniform design and uniform design.

Lemma 13 If x^〈1〉 follows the quasi-uniform random design, for any $λ = \frac{1}{n^{1 - c}}$ , m > 3/2, and any δ, c > 0, we have

P ({\hat{τ}}_{λ} ≍ τ_{λ}) \geq 1 - (n^{\frac{2}{2 m - 1} - 2 δ} + n^{\frac{1}{2 m - 1}}) exp {- c n^{\frac{2 m - 3}{2 m - 1} + 2 δ}},

where τ_λ = max{i | μ_i ≥ λ} and ${\hat{τ}}_{λ} = max {i | {\hat{μ}}_{i} \geq λ}$ .

Lemma 14 If x^〈1〉 follows the uniform fixed design condition, for m > 1/2 and λ > 0, we have

{\hat{τ}}_{λ} ≍ τ_{λ} .

In the following lemma, we bound Tr(Δ) by a function of ${\hat{τ}}_{λ}$ . This result is essential in deriving the asymptotic distribution of T_n,λ.

Lemma 15 For $Δ = M^{- 1} K_{11}^{2} M^{- 1}$ defined in Theorem 7, we have

\frac{4 {\hat{τ}}_{λ}}{9} \leq Tr (Δ) \leq \frac{4}{{(1 - θ_{d})}^{2}} ({\hat{τ}}_{λ} + \frac{1}{2 λ} \sum_{i = {\hat{τ}}_{λ} + 1}^{n} {\hat{μ}}_{i}) .

(34)

A.3.2. Proof of Theorem 7

Proof For simplicity, we suppose σ² = 1. We define the three terms on the right-hand side of Equation (26) as T₁, T₂ and T₃, i.e.,

T_{1} = \frac{1}{n} ϵ^{T} Δ ϵ, T_{2} = \frac{1}{n} ϵ^{T} M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} Δ S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1} ϵ, T_{3} = \frac{1}{n} ϵ^{T} M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} Δ ϵ .

We now show T₂ and T₃ are in smaller order compared to T₁. First, we analyze the second term T₂ in Equation (26). We have

E [T_{2}] = \frac{1}{n} E [ϵ^{T} M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} Δ S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1} ϵ] = \frac{1}{n} Tr (M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} Δ S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) \leq \frac{2}{n} λ_{max} (Δ) λ_{max} (M^{- 1} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} S {(S^{T} M^{- 1} S)}^{- 1} S^{T} M^{- 1}) \leq \frac{2}{n} λ_{max} (Δ),

where λ_max denotes the largest eigenvalue. Since all eigenvalues of Δ are less than 1, we have $E [T_{2}] \leq \frac{2}{n}$ . Analogously, we can derive the variance inequality of T₂. Combining the results together and using the Chebyshev inequality, we have

T_{2} = O_{p} (\frac{1}{n}) .

(35)

Second, we analyze the third term T₃ in Equation (26). We apply the Cauchy-Schwarz inequality and have

| T_{3} | \leq \sqrt{T_{2}} \sqrt{T_{1}} .

(36)

Finally, we derive the magnitude of T₁. We first consider the testing consistency of T₁ conditional on X. Denote $E_{ϵ}$ as the expectation with respect to ϵ, and define $V a r_{ϵ}$ as the variance with respect to ϵ. Note that

E_{ϵ} [ϵ^{T} Δ ϵ] = Tr (Δ), V a r_{ϵ} [ϵ^{T} Δ ϵ] = 2 Tr (Δ^{2}) .

Let $Z = (ϵ^{T} Δ ϵ - Tr (Δ)) / \sqrt{2 Tr (Δ^{2})}$ and t ∈ (−1/2, 1/2). Then the log-characteristic function of Z can be written as

log E_{ϵ} [exp (i t Z)] = log E_{ϵ} [exp (i t ϵ^{T} Δ ϵ / \sqrt{2 Tr (Δ^{2})})] - i t Tr (Δ) / \sqrt{2 Tr (Δ^{2})} = - \frac{1}{2} log det {I_{2 n} - 2 i t Δ / \sqrt{2 Tr (Δ^{2})}} - i t Tr (Δ) / \sqrt{2 Tr (Δ^{2})} .

(37)

Through Taylor expansion, one has

- \frac{1}{2} log det {I_{2 n} - 2 i t Δ / \sqrt{2 Tr (Δ^{2})}} = i t \frac{Tr (Δ)}{\sqrt{2 Tr (Δ^{2})}} - t^{2} \frac{Tr (Δ^{2})}{2 Tr (Δ^{2})} + O (t^{3} \frac{Tr (Δ^{3})}{{[Tr (Δ^{2})]}^{3 / 2}}) .

(38)

Combining Equations (37) and (38), we have

log E_{ϵ} [exp (i t Z)] = - \frac{t^{2}}{2} + O (t^{3} \frac{Tr (Δ^{3})}{{[Tr (Δ^{2})]}^{3 / 2}}) .

(39)

Since all eigenvalues of Δ are less than 1, we have $\frac{Tr (Δ^{3})}{Tr (Δ^{2})} \leq 1$ . Analogous to (S.11), we have

Tr (Δ^{2}) \geq \frac{16}{81} {\hat{τ}}_{λ} .

(40)

Under the quasi-uniform design, we have Tr(Δ²) → ∞ as λ → 0 with probability approaching 1 by Lemma 13 and Equation (40). Hence, the second term on the right-hand side of Equation (39) is o_p(1). We thus conclude that

E_{ϵ} [exp (i t Z)] \overset{P}{\to} exp (- \frac{t^{2}}{2}) .

Next, we show that

E [exp (i t Z)] = E_{X} [E_{ϵ} [exp (i t Z)]] \to exp (- t^{2} / 2)

for $t \in (- \frac{1}{2}, \frac{1}{2})$ . If not, there exists a subsequence of r.v $X_{n k}^{〈 1 〉}$ , such that for ∀ε > 0, $| E_{X_{n k}^{〈 1 〉}} E_{ϵ} exp (i t Z) - exp (- t^{2} / 2) | > ε$ . On the other hand, since $E_{ϵ} exp (i t Z (X_{n k}^{〈 1 〉})) \overset{P}{\to} exp (- t^{2} / 2)$ , which is bounded, there exists a sub-sub sequence ${X_{n_{k l}}^{〈 1 〉}}$ , such that $E_{ϵ} exp (i t Z (X_{n_{k l}}^{〈 1 〉})) \overset{a . s}{\to} exp (- t^{2} / 2)$ . Then by dominate convergence theorem, $E_{X_{n_{k l}}^{〈 1 〉}} E_{ϵ} exp (i t Z) \to exp (- t^{2} / 2)$ , which is a contradiction. Under the uniform design, we can easily obtain $E [exp (i t Z)] \to exp (- \frac{t^{2}}{2})$ by Lemma 14 and Equation (40).

Thus Z is asymptotically normally distributed, and

\frac{T_{1} - Tr (Δ) / n}{\sqrt{2 Tr (Δ^{2}) / n^{2}}} \overset{d}{\to} N (0, 1) .

(41)

Combining (35), (36) and (41), the theorem follows. ■

A.4. Proof of Theorem 8

Proof Under the alternative hypothesis, the statistic T_n,λ in Equation (26) can be decomposed into three terms as follows

T_{n, λ} = \frac{1}{n} ‖ H ϵ ‖_{2}^{2} + \frac{1}{n} {‖ H f_{11} ‖}_{2}^{2} + \frac{2}{n} f_{11}^{T} H^{T} H ϵ .

(42)

where H = θ₁₁K₁₁M⁻¹(I − S(S^TM⁻¹S)⁻¹S^TM⁻¹). Let $W_{1} = \frac{1}{n} ‖ H ϵ ‖_{2}^{2}$ , $W_{2} = \frac{1}{n} {‖ H f_{11} ‖}_{2}^{2}$ , and $W_{3} = \frac{2}{n} f_{11}^{T} H^{T} H ϵ$ denote corresponding three terms on the right-hand side of Equation (42).

We now derive a lower bound for W₂. By Lemma S.1, we have

\frac{1}{n} {‖ H f_{11} - f_{11} ‖}_{2}^{2} \leq \frac{1}{n} {‖ H f_{11} - f_{11} ‖}_{2}^{2} + \frac{1}{n} {‖ H f_{10} - f_{10} ‖}_{2}^{2} = \frac{1}{n} {‖ H f_{10} + H f_{11} - f_{10} - f_{11} ‖}_{2}^{2} = {‖ \tilde{g} * - g * ‖}_{n}^{2} \leq c λ .

(43)

Let $c^{'} = \sqrt{c}$ , we consider the distinguishable rate

\frac{1}{n} {‖ f_{11} ‖}_{2}^{2} = {‖ f_{11} ‖}_{n}^{2} > c^{' 2} d_{n}^{2} = c (λ + σ_{n, λ}) .

(44)

where the inequality is satisfied since ∥ · ∥_n dominates ∥ · ∥₂ by Lemma S.2. The lower bound of W₂ is thus,

W_{2} = \frac{1}{n} {‖ H f_{11} ‖}_{2}^{2} = \frac{1}{n} {‖ f_{11} ‖}_{2}^{2} - \frac{1}{n} {‖ f_{11} - H f_{11} ‖}_{2}^{2} \geq c d_{n}^{2} - c λ \geq c σ_{n, λ} .

(45)

where the first inequality is obtained by (43) and the second inequality is obtained through plugging in Equation (44).

For the third term W₃, it is seen that $E W_{3} = 0$ . It is easy to verify that the eigenvalues of HH^T are all less than 1. Moreover,

E W_{3}^{2} = \frac{4}{n^{2}} E [f_{11}^{T} H^{T} H ϵ ϵ^{T} H^{T} H f_{11}] = \frac{4}{n^{2}} {(H f_{11})}^{T} H H^{T} (H f_{11}) \leq \frac{4}{n^{2}} {(H f_{11})}^{T} (H f_{11}) = \frac{4}{n} W_{2} .

By the Chebyshev’s inequality, for any ϵ > 0, we have

ℙ (| W_{3} | \geq \frac{2 ϵ^{- \frac{1}{2}} W_{2}^{\frac{1}{2}}}{\sqrt{n}}) \leq \frac{n E W_{3}^{2}}{4 ϵ^{- 1} W_{2}} \leq ϵ .

Consequently, there exists an n₀, for any n > n₀, we have

ℙ {| W_{3} | > \frac{1}{2} W_{2}} \leq ℙ (| W_{3} | \geq \frac{2 ϵ^{- \frac{1}{2}} W_{2}^{\frac{1}{2}}}{\sqrt{n}}) \leq ϵ .

(46)

Now, we are ready to prove our theorem. By the triangle inequality, we have

| \frac{W_{1} - μ_{n, λ}}{σ_{n, λ}} + \frac{W_{2} + W_{3}}{σ_{n, λ}} | \geq | \frac{W_{2} + W_{3}}{σ_{n, λ}} | - | \frac{W_{1} - μ_{n, λ}}{σ_{n, λ}} |

(47)

\geq | \frac{W_{2}}{σ_{n, λ}} | - | \frac{W_{3}}{σ_{n, λ}} | - | \frac{W_{1} - μ_{n, λ}}{σ_{n, λ}} | .

(48)

If $\frac{| W_{1} - μ_{n, λ} |}{σ_{n, λ}} \leq C_{ϵ}$ , and $| W_{3} | \leq \frac{1}{2} W_{2}$ hold, in view of (47) and Equation (45), we have

| \frac{W_{1} - μ_{n, λ}}{σ_{n, λ}} + \frac{W_{2} + W_{3}}{σ_{n, λ}} | \geq \frac{1}{2} c - C_{ϵ} .

Noting that W₁ is identical to Equation (26), by Theorem 7, we have $\frac{| W_{1} - μ_{n, λ} |}{σ_{n, λ}} = O_{p} (1)$ . That is for any C_ϵ > 0, there exists an s, for any n > s, we have

ℙ (\frac{| W_{1} - μ_{n, λ} |}{σ_{n, λ}} > C_{ϵ}) \leq ϵ .

(49)

Setting $c \geq 2 (C_{ϵ} + z_{1 - \frac{α}{2}})$ and N = max(n₀, s), for any n > N, we have

ℙ (ϕ_{n, λ} = 1) = ℙ {\frac{| W_{1} + W_{2} + W_{3} - μ_{n, λ} |}{σ_{n, λ}} \geq z_{1 - \frac{α}{2}}} \geq ℙ {\frac{| W_{1} - μ_{n, λ} |}{σ_{n, λ}} \leq C_{ϵ}, | W_{3} | \leq \frac{1}{2} W_{2}} \geq 1 - ℙ {\frac{| W_{1} - μ_{n, λ} |}{σ_{n, λ}} > C_{ϵ}} - ℙ {| W_{3} | > \frac{1}{2} W_{2}} \geq 1 - 2 ϵ .

where the second inequality is due to the Boole’s inequality (Casella and Berger, 2002) and the last inequality is obtained by combining Equation (45) and Equation (49). Thus, we have

sup_{H_{1}^{*}} E (1 - ϕ_{n, λ} | H_{1}^{*} is true) < δ,

where $H_{1}^{*} = {f | f \in H_{model}^{\infty} and {‖ f_{11} ‖}_{2} \geq C_{δ} \sqrt{λ + σ_{n, λ}} ≜ d_{n}}$ . ■

Footnotes

^1.

The mth order Sobolev space is defined as $H^{〈 1 〉} = {η_{1} \in L_{2} [0, 1] | η_{1}^{(k)} is absolutely continuous for k = 0, 1, \dots, m - 1}$ .

^2.

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

References

Abramowitz Milton and Stegun Irene A. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC., 1964. [Google Scholar]
Alaoui Ahmed and Mahoney Michael W. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 775–783. 2015. [Google Scholar]
Bartlett Peter L, Bousquet Olivier, and Mendelson Shahar. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005. [Google Scholar]
Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. [Google Scholar]
Bilban Martin, Heintel Daniel, Scharl Theresa, Woelfel Thomas, Auer Michael M, Porpaczy Edit, Kainz Birgit, Krober Alexander, Carey Vincent J, Shehata Medhat, Zielinski C, Pickl W, Stilgenbauer S, Gaiger A, Wagner O, Jager U, and German CLL Study Group. Deregulated expression of fat and muscle genes in b-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia, 20(6):1080–1088, 2006. [DOI] [PubMed] [Google Scholar]
Braun Mikio L. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(Nov):2303–2328, 2006. [Google Scholar]
Casella George and Berger Roger L. Statistical inference. Duxbury Pacific Grove, CA, 2nd edition, 2002. [Google Scholar]
Degras David, Xu Zhiwei, Zhang Ting, and Wu Wei Biao. Testing for parallelism among trends in multiple time series. IEEE Transactions on Signal Processing, 60(3):1087–1097, 2011. [Google Scholar]
Drineas Petros and Mahoney Michael W. On the nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153–2175, 2005. [Google Scholar]
Echávarri C, Aalten P, Uylings HBM, Jacobs HIL, Visser PJ, Gronenschild EHBM, Verhey FRJ, and Burgmans S. Atrophy in the parahippocampal gyrus as an early biomarker of alzheimer’s disease. Brain Structure and Function, 215(3–4):265–271, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eggermont Paulus Petrus Bernardus and LaRiccia Vincent N. Maximum penalized likelihood estimation, volume II. Springer, 2001. [DOI] [PubMed] [Google Scholar]
Fan Jianqing and Zhang Jian. Sieve empirical likelihood ratio tests for nonparametric functions. Ann. Statist, 32(5):1858–1907, 10 2004. [Google Scholar]
Fan Jianqing, Zhang Chunming, and Zhang Jian. Generalized likelihood ratio statistics and wilks phenomenon. Ann. Statist, 29(1):153–193, 02 2001. [Google Scholar]
Filarsky Katharina, Garding Angela, Becker Natalia, Wolf Christine, Zucknick Manuela, Claus Rainer, Weichenhan Dieter, Plass Christoph, Hartmut Döhner Stephan Stilgenbauer, Lichter Peter, and Mertens Daniel. Krüppel-like factor 4 (klf4) inactivation in chronic lymphocytic leukemia correlates with promoter dna-methylation and can be reversed by inhibition of notch signaling. Haematologica, 101(6):249, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Giné Evarist and Nickl Richard. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2015. [Google Scholar]
Golub Gene H, Heath Michael, and Wahba Grace. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979. [Google Scholar]
Gu Chong. Model diagnostics for smoothing spline ANOVA models. Canadian Journal of Statistics, 32(4):347–358, 2004. [Google Scholar]
Gu Chong. Smoothing spline ANOVA models. Springer, 2nd edition, 2013. [Google Scholar]
Hansen Kasper D, Langmead Benjamin, and Irizarry Rafael A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10):R83, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huettel Scott A, Song Allen W, and McCarthy Gregory. Functional magnetic resonance imaging. Sinauer Associates; Sunderland, 2004. [Google Scholar]
Ingster Yuri and Suslina Irina A. Nonparametric goodness-of-fit testing under Gaussian models. Springer Science & Business Media, 2012. [Google Scholar]
Ingster Yuri I. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Math. Methods Statist, 2(2):85–114, 1993. [Google Scholar]
Irizarry Rafael A, Ladd-Acosta Christine, Carvalho Benilton, Wu Hao, Brandenburg Sheri A, Jeddeloh Jeffrey A, Wen Bo, and Feinberg Andrew P. Comprehensive high-throughput arrays for relative methylation (charm). Genome Research, 18(5):780–790, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kesslak J Patrick, Nalcioglu Orhan, and Cotman Carl W. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in alzheimer’s disease. Neurology, 41(1):51–51, 1991. [DOI] [PubMed] [Google Scholar]
Kim Young-Ju and Gu Chong. Smoothing spline gaussian regression: more scalable computation via efficient approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):337–356, 2004. [Google Scholar]
Lin Yi. Tensor product space ANOVA models. Annals of Statistics, 28(3):734–755, 2000. [Google Scholar]
Liu Anna and Wang Yuedong. Hypothesis testing in smoothing spline models. Journal of Statistical Computation and Simulation, 74(8):581–597, 2004. [Google Scholar]
Liu Meimei and Cheng Guang. Early stopping for nonparametric testing. In Advances in Neural Information Processing Systems, pages 3985–3994, 2018. [Google Scholar]
Liu Meimei, Shang Zuofeng, and Cheng Guang. Sharp theoretical analysis for nonparametric testing under random projection. In Conference on Learning Theory, pages 2175–2209, 2019. [Google Scholar]
Liu Qiang, Lee Jason, and Jordan Michael. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284, 2016. [Google Scholar]
Ma Ping, Zhong Wenxuan, and Liu Jun S. Identifying differentially expressed genes in time course microarray data. Statistics in Biosciences, 1(2):144, 2009. [Google Scholar]
Ma Ping, Mahoney Michael W, and Yu Bin. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1):861–911, 2015. [Google Scholar]
Ma Siyuan and Belkin Mikhail. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pages 3778–3787, 2017. [Google Scholar]
Munk Axel and Dette Holger. Nonparametric comparison of several regression functions: exact and asymptotic theory. Annals of Statistics, 26(6):2339–2368, 1998. [Google Scholar]
Nichols Thomas E and Holmes Andrew P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1):1–25, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Okano Masaki, Xie Shaoping, and Li En. Cloning and characterization of a family of novel mammalian dna (cytosine-5) methyltransferases. Nature Genetics, 19(3):219, 1998. [DOI] [PubMed] [Google Scholar]
Orrison William W, Lewine Jeffrey, Sanders John, and Hartshorne Michael F. Functional brain imaging. Elsevier Health Sciences, 2017. [Google Scholar]
Pallasch CP, Schwamb J, Königs S, Schulz A, Debey S, Kofler D, Schultze JL, Hallek M, Ultsch A, and Wendtner CM. Targeting lipid metabolism by the lipoprotein lipase inhibitor orlistat results in apoptosis of b-cell chronic lymphocytic leukemia cells. Leukemia, 22(3):585–592, 2008. [DOI] [PubMed] [Google Scholar]
Pinkus Allan. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012. [Google Scholar]
Rami Lorena, Sala-Llonch Roser, Solé-Padullés Cristina, Fortea Juan, Olives Jaume, Lladó Albert, Peña-Gómez Cleofe, Balasa Mircea, Bosch Bea, Antonell Anna, Sanchez-Valle R, Bartrés-Faz D, and Molinuevo JL. Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer’s disease. Journal of Alzheimer’s Disease, 31(3):517–526, 2012. [DOI] [PubMed] [Google Scholar]
Raskutti Garvesh, Wainwright Martin J, and Yu Bin. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335–366, 2014. [Google Scholar]
Rombouts Serge ARB, Barkhof Frederik, Goekoop Rutger, Stam Cornelis J, and Scheltens Philip. Altered resting state networks in mild cognitive impairment and mild alzheimer’s disease: an fmri study. Human Brain Mapping, 26(4):231–239, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scheff Stephen W, Price Douglas A, Ansari Mubeen A, Roberts Kelly N, Schmitt Frederick A, Ikonomovic Milos D, and Mufson Elliott J. Synaptic change in the posterior cingulate gyrus in the progression of alzheimer’s disease. Journal of Alzheimer’s Disease, 43(3):1073–1090, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schölkopf Bernhard, Herbrich Ralf, and Smola Alex J. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416–426. Springer, 2001. [Google Scholar]
Schübeler Dirk. Function and information content of dna methylation. Nature, 517(7534):321, 2015. [DOI] [PubMed] [Google Scholar]
Shang Zuofeng and Cheng Guang. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608–2638, 2013. [Google Scholar]
Shang Zuofeng and Cheng Guang. Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(1):3809–3845, 2017. [Google Scholar]
Shen Xiaotong, Huang Hsin-Cheng, and Cressie Noel. Nonparametric hypothesis testing for a spatial signal. Journal of the American Statistical Association, 97(460):1122–1140, 2002. [Google Scholar]
Smith Stephen M, Jenkinson Mark, Woolrich Mark W, Beckmann Christian F, Behrens Timothy EJ, Johansen-Berg Heidi, Bannister Peter R, De Luca Marilena, Drobnjak Ivana, Flitney David E, Niazy RK, Saunders J, Vickers J, Zhang Y, De Stefano N, Brady JM, and Matthews PM. Advances in functional and structural mr image analysis and implementation as fsl. Neuroimage, 23: S208–S219, 2004. [DOI] [PubMed] [Google Scholar]
Stach Dirk, Schmitz Oliver J, Stilgenbauer Stephan, Benner Axel, DoÈhner Hartmut, Wiessler Manfred, and Lyko Frank. Capillary electrophoretic analysis of genomic DNA methylation levels. Nucleic Acids Research, 31(2):e2–e2, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sthle Lars and Wold Svante. Analysis of variance (anova). Chemometrics and Intelligent Laboratory Systems, 6(4):259–272, 1989. [Google Scholar]
Storey John D, Xiao Wenzhong, Leek Jeffrey T, Tompkins Ronald G, and Davis Ronald W. Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102(36):12837–12842, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Subramanian Aravind, Tamayo Pablo, Mootha Vamsi K, Mukherjee Sayan, Ebert Benjamin L, Gillette Michael A, Paulovich Amanda, Pomeroy Scott L, Golub Todd R, Lander Eric S, , , and Mesirov Jill P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43): 15545–15550, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vossoughi Mehrdad, Ayatollahi SMT, Towhidi Mina, and Heydari Seyyed Taghi. A distribution-free test of parallelism for two-sample repeated measurements. Statistical Methodology, 30:31–44, 2016. [Google Scholar]
Wahba Grace. Spline models for observational data. SIAM, 1990. [Google Scholar]
Wahba Grace, Wang Yuedong, Gu Chong, Klein Ronald, Klein Barbara, et al. Smoothing spline anova for exponential families, with application to the wisconsin epidemiological study of diabetic retinopathy: the 1994 neyman memorial lecture. Annals of Statistics, 23(6):1865–1895, 1995. [Google Scholar]
Wang Liang, Zang Yufeng, He Yong, Liang Meng, Zhang Xinqing, Tian Lixia, Wu Tao, Jiang Tianzi, and Li Kuncheng. Changes in hippocampal connectivity in the early stages of alzheimer’s disease: evidence from resting state fmri. Neuroimage, 31(2):496–504, 2006. [DOI] [PubMed] [Google Scholar]
Wang Yazhen. Change curve estimation via wavelets. Journal of the American Statistical Association, 93(441):163–172, 1998. [Google Scholar]
Wang Yuedong. Smoothing splines: methods and applications. CRC Press, 2011. [Google Scholar]
Wei Yuting and Wainwright Martin J. The local geometry of testing in ellipses: Tight control via localized kolmogorov widths. IEEE Transactions on Information Theory, 2020. [Google Scholar]
Wood Simon N. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003. [Google Scholar]
Yang Yun, Pilanci Mert, Wainwright Martin J, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, 45(3):991–1023, 2017. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1641658-supplement-1.pdf^{(246.1KB, pdf)}

[R1] Abramowitz Milton and Stegun Irene A. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC., 1964. [Google Scholar]

[R2] Alaoui Ahmed and Mahoney Michael W. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 775–783. 2015. [Google Scholar]

[R3] Bartlett Peter L, Bousquet Olivier, and Mendelson Shahar. Local rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005. [Google Scholar]

[R4] Benjamini Yoav and Hochberg Yosef. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. [Google Scholar]

[R5] Bilban Martin, Heintel Daniel, Scharl Theresa, Woelfel Thomas, Auer Michael M, Porpaczy Edit, Kainz Birgit, Krober Alexander, Carey Vincent J, Shehata Medhat, Zielinski C, Pickl W, Stilgenbauer S, Gaiger A, Wagner O, Jager U, and German CLL Study Group. Deregulated expression of fat and muscle genes in b-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia, 20(6):1080–1088, 2006. [DOI] [PubMed] [Google Scholar]

[R6] Braun Mikio L. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(Nov):2303–2328, 2006. [Google Scholar]

[R7] Casella George and Berger Roger L. Statistical inference. Duxbury Pacific Grove, CA, 2nd edition, 2002. [Google Scholar]

[R8] Degras David, Xu Zhiwei, Zhang Ting, and Wu Wei Biao. Testing for parallelism among trends in multiple time series. IEEE Transactions on Signal Processing, 60(3):1087–1097, 2011. [Google Scholar]

[R9] Drineas Petros and Mahoney Michael W. On the nyström method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153–2175, 2005. [Google Scholar]

[R10] Echávarri C, Aalten P, Uylings HBM, Jacobs HIL, Visser PJ, Gronenschild EHBM, Verhey FRJ, and Burgmans S. Atrophy in the parahippocampal gyrus as an early biomarker of alzheimer’s disease. Brain Structure and Function, 215(3–4):265–271, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Eggermont Paulus Petrus Bernardus and LaRiccia Vincent N. Maximum penalized likelihood estimation, volume II. Springer, 2001. [DOI] [PubMed] [Google Scholar]

[R12] Fan Jianqing and Zhang Jian. Sieve empirical likelihood ratio tests for nonparametric functions. Ann. Statist, 32(5):1858–1907, 10 2004. [Google Scholar]

[R13] Fan Jianqing, Zhang Chunming, and Zhang Jian. Generalized likelihood ratio statistics and wilks phenomenon. Ann. Statist, 29(1):153–193, 02 2001. [Google Scholar]

[R14] Filarsky Katharina, Garding Angela, Becker Natalia, Wolf Christine, Zucknick Manuela, Claus Rainer, Weichenhan Dieter, Plass Christoph, Hartmut Döhner Stephan Stilgenbauer, Lichter Peter, and Mertens Daniel. Krüppel-like factor 4 (klf4) inactivation in chronic lymphocytic leukemia correlates with promoter dna-methylation and can be reversed by inhibition of notch signaling. Haematologica, 101(6):249, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Giné Evarist and Nickl Richard. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2015. [Google Scholar]

[R16] Golub Gene H, Heath Michael, and Wahba Grace. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979. [Google Scholar]

[R17] Gu Chong. Model diagnostics for smoothing spline ANOVA models. Canadian Journal of Statistics, 32(4):347–358, 2004. [Google Scholar]

[R18] Gu Chong. Smoothing spline ANOVA models. Springer, 2nd edition, 2013. [Google Scholar]

[R19] Hansen Kasper D, Langmead Benjamin, and Irizarry Rafael A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10):R83, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Huettel Scott A, Song Allen W, and McCarthy Gregory. Functional magnetic resonance imaging. Sinauer Associates; Sunderland, 2004. [Google Scholar]

[R21] Ingster Yuri and Suslina Irina A. Nonparametric goodness-of-fit testing under Gaussian models. Springer Science & Business Media, 2012. [Google Scholar]

[R22] Ingster Yuri I. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Math. Methods Statist, 2(2):85–114, 1993. [Google Scholar]

[R23] Irizarry Rafael A, Ladd-Acosta Christine, Carvalho Benilton, Wu Hao, Brandenburg Sheri A, Jeddeloh Jeffrey A, Wen Bo, and Feinberg Andrew P. Comprehensive high-throughput arrays for relative methylation (charm). Genome Research, 18(5):780–790, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Kesslak J Patrick, Nalcioglu Orhan, and Cotman Carl W. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in alzheimer’s disease. Neurology, 41(1):51–51, 1991. [DOI] [PubMed] [Google Scholar]

[R25] Kim Young-Ju and Gu Chong. Smoothing spline gaussian regression: more scalable computation via efficient approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):337–356, 2004. [Google Scholar]

[R26] Lin Yi. Tensor product space ANOVA models. Annals of Statistics, 28(3):734–755, 2000. [Google Scholar]

[R27] Liu Anna and Wang Yuedong. Hypothesis testing in smoothing spline models. Journal of Statistical Computation and Simulation, 74(8):581–597, 2004. [Google Scholar]

[R28] Liu Meimei and Cheng Guang. Early stopping for nonparametric testing. In Advances in Neural Information Processing Systems, pages 3985–3994, 2018. [Google Scholar]

[R29] Liu Meimei, Shang Zuofeng, and Cheng Guang. Sharp theoretical analysis for nonparametric testing under random projection. In Conference on Learning Theory, pages 2175–2209, 2019. [Google Scholar]

[R30] Liu Qiang, Lee Jason, and Jordan Michael. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284, 2016. [Google Scholar]

[R31] Ma Ping, Zhong Wenxuan, and Liu Jun S. Identifying differentially expressed genes in time course microarray data. Statistics in Biosciences, 1(2):144, 2009. [Google Scholar]

[R32] Ma Ping, Mahoney Michael W, and Yu Bin. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1):861–911, 2015. [Google Scholar]

[R33] Ma Siyuan and Belkin Mikhail. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pages 3778–3787, 2017. [Google Scholar]

[R34] Munk Axel and Dette Holger. Nonparametric comparison of several regression functions: exact and asymptotic theory. Annals of Statistics, 26(6):2339–2368, 1998. [Google Scholar]

[R35] Nichols Thomas E and Holmes Andrew P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1):1–25, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Okano Masaki, Xie Shaoping, and Li En. Cloning and characterization of a family of novel mammalian dna (cytosine-5) methyltransferases. Nature Genetics, 19(3):219, 1998. [DOI] [PubMed] [Google Scholar]

[R37] Orrison William W, Lewine Jeffrey, Sanders John, and Hartshorne Michael F. Functional brain imaging. Elsevier Health Sciences, 2017. [Google Scholar]

[R38] Pallasch CP, Schwamb J, Königs S, Schulz A, Debey S, Kofler D, Schultze JL, Hallek M, Ultsch A, and Wendtner CM. Targeting lipid metabolism by the lipoprotein lipase inhibitor orlistat results in apoptosis of b-cell chronic lymphocytic leukemia cells. Leukemia, 22(3):585–592, 2008. [DOI] [PubMed] [Google Scholar]

[R39] Pinkus Allan. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012. [Google Scholar]

[R40] Rami Lorena, Sala-Llonch Roser, Solé-Padullés Cristina, Fortea Juan, Olives Jaume, Lladó Albert, Peña-Gómez Cleofe, Balasa Mircea, Bosch Bea, Antonell Anna, Sanchez-Valle R, Bartrés-Faz D, and Molinuevo JL. Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer’s disease. Journal of Alzheimer’s Disease, 31(3):517–526, 2012. [DOI] [PubMed] [Google Scholar]

[R41] Raskutti Garvesh, Wainwright Martin J, and Yu Bin. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335–366, 2014. [Google Scholar]

[R42] Rombouts Serge ARB, Barkhof Frederik, Goekoop Rutger, Stam Cornelis J, and Scheltens Philip. Altered resting state networks in mild cognitive impairment and mild alzheimer’s disease: an fmri study. Human Brain Mapping, 26(4):231–239, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Scheff Stephen W, Price Douglas A, Ansari Mubeen A, Roberts Kelly N, Schmitt Frederick A, Ikonomovic Milos D, and Mufson Elliott J. Synaptic change in the posterior cingulate gyrus in the progression of alzheimer’s disease. Journal of Alzheimer’s Disease, 43(3):1073–1090, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Schölkopf Bernhard, Herbrich Ralf, and Smola Alex J. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416–426. Springer, 2001. [Google Scholar]

[R45] Schübeler Dirk. Function and information content of dna methylation. Nature, 517(7534):321, 2015. [DOI] [PubMed] [Google Scholar]

[R46] Shang Zuofeng and Cheng Guang. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608–2638, 2013. [Google Scholar]

[R47] Shang Zuofeng and Cheng Guang. Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(1):3809–3845, 2017. [Google Scholar]

[R48] Shen Xiaotong, Huang Hsin-Cheng, and Cressie Noel. Nonparametric hypothesis testing for a spatial signal. Journal of the American Statistical Association, 97(460):1122–1140, 2002. [Google Scholar]

[R49] Smith Stephen M, Jenkinson Mark, Woolrich Mark W, Beckmann Christian F, Behrens Timothy EJ, Johansen-Berg Heidi, Bannister Peter R, De Luca Marilena, Drobnjak Ivana, Flitney David E, Niazy RK, Saunders J, Vickers J, Zhang Y, De Stefano N, Brady JM, and Matthews PM. Advances in functional and structural mr image analysis and implementation as fsl. Neuroimage, 23: S208–S219, 2004. [DOI] [PubMed] [Google Scholar]

[R50] Stach Dirk, Schmitz Oliver J, Stilgenbauer Stephan, Benner Axel, DoÈhner Hartmut, Wiessler Manfred, and Lyko Frank. Capillary electrophoretic analysis of genomic DNA methylation levels. Nucleic Acids Research, 31(2):e2–e2, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Sthle Lars and Wold Svante. Analysis of variance (anova). Chemometrics and Intelligent Laboratory Systems, 6(4):259–272, 1989. [Google Scholar]

[R52] Storey John D, Xiao Wenzhong, Leek Jeffrey T, Tompkins Ronald G, and Davis Ronald W. Significance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102(36):12837–12842, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Subramanian Aravind, Tamayo Pablo, Mootha Vamsi K, Mukherjee Sayan, Ebert Benjamin L, Gillette Michael A, Paulovich Amanda, Pomeroy Scott L, Golub Todd R, Lander Eric S, , , and Mesirov Jill P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43): 15545–15550, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Vossoughi Mehrdad, Ayatollahi SMT, Towhidi Mina, and Heydari Seyyed Taghi. A distribution-free test of parallelism for two-sample repeated measurements. Statistical Methodology, 30:31–44, 2016. [Google Scholar]

[R55] Wahba Grace. Spline models for observational data. SIAM, 1990. [Google Scholar]

[R56] Wahba Grace, Wang Yuedong, Gu Chong, Klein Ronald, Klein Barbara, et al. Smoothing spline anova for exponential families, with application to the wisconsin epidemiological study of diabetic retinopathy: the 1994 neyman memorial lecture. Annals of Statistics, 23(6):1865–1895, 1995. [Google Scholar]

[R57] Wang Liang, Zang Yufeng, He Yong, Liang Meng, Zhang Xinqing, Tian Lixia, Wu Tao, Jiang Tianzi, and Li Kuncheng. Changes in hippocampal connectivity in the early stages of alzheimer’s disease: evidence from resting state fmri. Neuroimage, 31(2):496–504, 2006. [DOI] [PubMed] [Google Scholar]

[R58] Wang Yazhen. Change curve estimation via wavelets. Journal of the American Statistical Association, 93(441):163–172, 1998. [Google Scholar]

[R59] Wang Yuedong. Smoothing splines: methods and applications. CRC Press, 2011. [Google Scholar]

[R60] Wei Yuting and Wainwright Martin J. The local geometry of testing in ellipses: Tight control via localized kolmogorov widths. IEEE Transactions on Information Theory, 2020. [Google Scholar]

[R61] Wood Simon N. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114, 2003. [Google Scholar]

[R62] Yang Yun, Pilanci Mert, Wainwright Martin J, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, 45(3):991–1023, 2017. [Google Scholar]

PERMALINK

Minimax Nonparametric Parallelism Test

Xin Xing

Meimei Liu

Ping Ma

Wenxuan Zhong

Abstract

1. Introduction

Figure 1:

2. Background

2.1. Reproducing Kernel Hilbert Space

2.2. Decomposition of Tensor Product Space

2.3. Penalized Least Squares

3. Minimax Principle of the Nonparametric Parallelism Test

Figure 2:

4. Wald Type Parallism Test

4.1. Wald Type Test and Asymptotic Distribution

4.2. Upper Bound of the Distingushiable Rate

4.3. The Choice of Regularization Parameter

5. Simulation Study

5.1. Empirical Power Analysis

Figure 3:

Table 1:

Table 2:

Table 6:

Table 3:

Table 4:

5.2. Empirical Size Analysis

Table 5:

5.3. Computation Time

5.4. Simulation Studies with Correlated Noise

Table 7:

5.5. Simulation Studies with Non-smooth Cases

Figure 4:

Table 8:

6. Real Data Examples

6.1. DNA Methylation in Chronic Lymphocytic Leukemia

Figure 5:

6.2. Neuroimaging of Alzheimer’s Disease using fMRI

Figure 6:

7. Discussion

Supplementary Material

Acknowledgments

Appendix A. Proof of Main Results

A.1. Notation Clarification

A.2. Proofs for Section 3

A.2.1. Preliminary

A.2.2. Proof of Lemma 4

A.2.3. Proof of Lemma 5

A.2.4. Proof of Theorem 6

A.3. Proof of Theorem 7

A.3.1. Some Auxiliary Lemmas

A.3.2. Proof of Theorem 7

A.4. Proof of Theorem 8

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases