Statistical properties on semiparametric regression for evaluating pathway effects

Inyoung Kim; Herbert Pang; Hongyu Zhao

doi:10.1016/j.jspi.2012.09.009

. Author manuscript; available in PMC: 2014 Apr 1.

Published in final edited form as: J Stat Plan Inference. 2013 Apr;143(4):745–763. doi: 10.1016/j.jspi.2012.09.009

Statistical properties on semiparametric regression for evaluating pathway effects

Inyoung Kim ^a,^*, Herbert Pang ^b, Hongyu Zhao ^c,^d

PMCID: PMC3763850 NIHMSID: NIHMS455567 PMID: 24014933

Abstract

Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.’s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have $\sqrt{n}$ convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.’s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).

Keywords: Gaussian random process, Kernel machine, Mixed model, Pathway analysis, Profile likelihood, Restricted maximum likelihood

1. Introduction

Numerous statistical methods have been developed for analyzing microarray data based on single genes (Efron et al., 2004; Fan and Li, 2001; Tibshirani, 1996; Zou and Hastie, 2005). However, they may not detect coordinated yet subtle changes among a set of genes where each gene only shows modest changes. One way to address the limitation of single gene based analysis is to analyze gene sets derived from prior biological knowledge to uncover patterns among the genes within a set. A number of methods and programs have been developed to consider gene groupings based on gene ontology (GO) (Harris et al., 2004). These methods have been successful to detect subtle changes in expression levels which could be missed using single gene-based analysis (Mootha et al., 2003; Hosack et al., 2003; Rajagopalan and Agarwal, 2005). In our following discussion, we consider a pathway as a predefined set of genes that serve a particular cellular or physiological function.

Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. Goeman et al. (2004) proposed a global test derived from a random effects model. Liu et al. (2007) proposed a semiparametric regression model for high dimensional covariate data. Liu et al.’s approach is developed by connecting between a least squares kernel machine in a nonparametric model with REML in a linear mixed model. However, statistical properties of their approach have not been studied. In the rest of this paper, we refer Liu et al.’s approach to LLG’s approach.

The goal of our study is to provide the asymptotic properties of these estimators on consistency, convergence rate, and limiting distributions under regularity conditions. We compare LLG’s estimators with ours which are estimated using a REML obtained from a profile likelihood. We note that although our likelihood is a penalized likelihood, we simply use it as a profile likelihood. Note that there is a penalized likelihood called penalized Henderson’s likelihood (Wang, 1998; Gu and Ma, 2005) which are obtained using smoothing splines.

We show that our REML estimators give more accurate score equations and information matrix, and have smaller mean squared errors (MSEs) than those of LLG’s REML. For non-standard testing problems, a profile restricted likelihood ratio test is also provided.

This paper is organized as follows. In Section 2, we give the definition of the least square kernel machine estimation in the semiparametric model. We then explain LLG’s likelihood-based approach on semiparametric regression and compare their method with our REML obtained from a profile likelihood approach. Section 3 describes the asymptotic properties for both estimators. In Section 4, we provide a testing procedure based on a profile restricted likelihood ratio test for nonstandard testing problem and also derive its theoretical distribution. In Section 5 we report the simulation results comparing MSEs, type I errors, and power of LLG’s estimators with those. In Section 6, we apply our approach to the type II diabetes data set analyzed by Mootha et al. (2003). Section 7 contains concluding remarks.

2. Semiparametric regression

A semiparametric regression model can be written as

Y = X β + r (Z) + ε,

(1)

where Y is an n × 1 vector denoting the continuous outcome measured on n subjects, X is an n × q matrix representing q clinical covariates of these subjects, β is a q × 1 vector of regression coefficients for the covariate effects, Z is a p × n matrix denoting the gene expression matrix for p genes (p ≫ n), Z = [z₁, …, z_n], and z_i is a p × 1 vector for the gene expression levels of the ith subject, r(·) is an unknown nonlinear smooth function, and ε ~ N(0, σ²I), where σ² >0. Because of the high dimensional space of Z, we make statistical inference for the model (1) by connecting a least squares kernel machine with a Gaussian random process, where r(Z) = {r(z₁), …, r(z_n)} ~ N(0, τK) with r(·) following a Gaussian process (GP) with mean 0 and covariance cov{r(z), r(z′)} = τK(z, z′), τ>0, z and z′ represent two different arbitrary p × 1 vectors, K(·, ·) is a kernel function which implicitly specifies a unique function space spanned by a particular set of orthogonal basis functions, and K is an n × n matrix with the ijth component K(z_i, z_j), i, j = 1, …, n.

The linear effects of the clinical variables are adjusted in this model. This model reduces to the standard linear regression model when τ = 0. If τ>0, two samples i and j with similar gene expression patterns have correlated random effects r(z_i) and r(z_j), and therefore they have a greater probability of having similar outcomes y_i and y_j than samples with less similar expression patterns. This “similarity” is measured using a kernel function.

We consider the following three kernels K(·, ·) to model the covariance matrix of pathway effects:

The dth order polynomial kernel, K(z, z′) = (zz′)^d, d = 1, 2, which quantifies similarity through the inner-product.
Gaussian kernel, K(z, z′) = exp(−||z−z′||²/ρ), where ${| | z - z^{'} | |}^{2} = \sum_{k = 1}^{p} {(z_{k} - {z_{k}}^{'})}^{2}$ and ρ is an unknown scale parameter (ρ>0). The “similarity” in this kernel is measured using the Euclidean distance between z and z′.
Neural network kernel, K(z, z′) = tanh(z · z′), which uses the hyperbolic tangent (tanh) function to quantify similarity.

By connecting a least squares kernel machine (Cristianini and Shawe-Taylor, 2006) with a restricted maximum likelihood (REML) (Zhang et al., 1998; Wang, 1998), Liu et al. (2007) estimated the nonparametric pathway effects of multiple gene expressions, r(Z), using the least squares kernel machine, estimated the covariate effects, β, using the weighted least squares estimator, and estimated the parameters of the variance components using REML. They derived the score equations and information matrix of variance component parameters by treating the estimated β̂ as if they were the true parameter without uncertainty. However, because β̂ depends on the variance components parameters, we consider incorporating the relationship between β̂ and the variance components parameters into the REML to obtain more accurate score equations and information matrix of these parameters.

We adopt the same general approach to estimate the nonparametric function r(·). β and r(·) are estimated by maximizing the scaled penalized likelihood function with smoothing parameter λ

L {β, r (\cdot)} = - \frac{1}{2} \sum_{i = 1}^{n} {y_{i} - x_{i} β - r (z_{i})}^{2} - \frac{1}{2} λ {| | r | |}_{H_{K}}^{2},

where ||·||_{H_K} represents the norm induced by Kernel K of H_k, the function space generated by a kernel function K. This estimation is called least square kernel machine estimation (Liu et al., 2007).

By using the dual formulation (Cristianini and Shawe-Taylor, 2006; Liu et al., 2007) to reduce the high dimensional problem into a low dimensional one, the parameters β and r(·) can be estimated as follows:

\begin{array}{l} \hat{β} = {X^{T} {(I + λ^{- 1} K)}^{- 1} X}^{- 1} X^{T} {(I + λ^{- 1} K)}^{- 1} y, \\ \hat{r} = λ^{- 1} K {(I + λ^{- 1} K)}^{- 1} (y - X \hat{β}) . \end{array}

For estimating the variance components parameters, we start from a mixed model formulation with the assumption that the nonparametric function r(·) follows a Gaussian process (GP) with mean 0 and covariance cov{r(z), r(z′)} = τK(z, z′) = τ exp(−||z−z′||²/ρ), where ρ>0 is an unknown scale parameter. Using the marginal distribution of y ~ N(Xβ, Σ), the regression coefficient estimator β̂ can be obtained from the weighted least squares estimator β̂ = (X^TΣ⁻¹X)⁻¹X^TΣ⁻¹Y, where Σ = σ²I+τK = Σ(Θ), where Θ = (τ, ρ, σ²) and Σ(Θ) are an n × n positive definite variance matrix. The covariance of β̂ and the nonparametric function r̂(·) can be estimated as cov(β̂) = (X^TΣ⁻¹X)⁻¹, cov(r̂) = τK−(τK)P(τK), and cov{r̂(z)} = τK(z, z)− (τK_z)P(τK_z), where P = Σ⁻¹−Σ⁻¹X(X^TΣ⁻¹X)⁻¹X^TΣ⁻¹ and K_z = {K(z, z₁), …, K(z, z_n)}^T. Based on the above estimates, the variance components parameters, Θ = (τ, ρ, σ²), can be estimated through the REML, a commonly used approach for estimating variance components in the mixed effects model, and the smoothing parameter can be obtained from λ = τ⁻¹σ². The REML is

l (Θ) = - \frac{1}{2} log ∣ \sum ∣ - \frac{1}{2} log ∣ X^{T} \sum^{- 1} X ∣ - \frac{1}{2} {(y - X β)}^{T} \sum^{- 1} (y - X β) .

LLG’s REML and our REML are

l_{1} (Θ) = - \frac{1}{2} {(y - X \hat{β})}^{T} \sum^{- 1} (y - X \hat{β}) - \frac{1}{2} log ∣ X^{T} \sum^{- 1} X ∣ - \frac{1}{2} log ∣ \sum ∣,

(2)

l_{2} (Θ) = - \frac{1}{2} {y - X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1} y}^{T} \sum^{- 1} {y - X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1} y} - \frac{1}{2} log ∣ X^{T} \sum^{- 1} X ∣ - \frac{1}{2} log ∣ \sum ∣ .

(3)

LLG calculate the score equations and information matrix of Θ = (τ, ρ, σ²) at given β̂. Unlike LLG’s approach (2007), we consider a profile likelihood and replace β̂ by (X^TΣ⁻¹X)⁻¹X^TΣ⁻¹y because β̂ depends on the parameters Θ. We then obtain the score equations and information matrix of Θ (see Appendix A.1). The parameters are estimated using the Newton–Raphson method to estimate all parameters.

3. Asymptotic properties

In this section, we will provide the asymptotic properties on consistency, convergence rate, and limiting distribution for the estimators. We will also show that the estimators from our REML have asymptotically smaller MSE than those of LLG’s REML.

We first give a consistency result of our estimators when the true parameter Θ₀ is either in the interior region or on the boundary of parameter space Ω. For this result we need that the parameter space (Ω) near Θ₀ behaves like a closed set (Self and Liang, 1987; Vu and Zhou, 1997) which imposes the condition that the intersection of Ω and the closure of neighborhoods centered about Θ₀ compose closed subsets. This requirement is satisfied because our Ω is a rectangle which is a cross product of intervals.

Theorem 1

If Θ̂₁ and Θ̂₂ are estimators of Θ using REML functions (2) and (3), respectively, then Θ̂₁ and Θ̂₂ are both consistent estimators under regularity conditions.

Theorem 1 proves that the estimators from both approaches are consistent under regularity conditions. These regularity conditions are described in Appendix A.2. The proof of Theorem 1 is summarized in Appendix A.3.

Next, we will show that both approaches have $\sqrt{n}$ convergence rate. Before proving this convergence rate, we first explain the following two lemmas (Lemmas 1 and 2).

Lemma 1

If Σ and Σ_k are simultaneously diagonalizable, where Σ_k is the first derivative of Σ with respect to the kth component of Θ, then (1/n) tr(PΣ_k) uniformly converges. That is, (1/n) tr(PΣ_τ) and (1/n) tr(PΣ_σ₂) uniformly converge.

Its proof is summarized in Appendix A.4. Lemma 1 implies that tr(PΣ_iPΣ_j)^1/2 uniformly converges because 0≤tr(PΣ_iPΣ_j)^1/2 ≤tr(PΣ_i)^1/2 tr(PΣ_j)^1/2.

Lemma 2

If ΣΣ_k = Σ_kΣ, then Σ and Σ_k are simultaneously diagonalizable.

It is well known that if AB = BA for two matrices A and B, then A and B are simultaneously diagonalizable. We need to use this fact in our proof.

The $\sqrt{n}$ convergence rate can be proved using (1) these two lemmas, (2) four conditions C1–C4 (see Appendix A.5), which are described in Cressie and Lahiri (1993, 1996), and (3) the condition that the intersection of Ω and the closure of neighborhoods centered about Θ₀ constitute closed subsets (Vu and Zhou, 1997).

Theorem 2

Under conditions C₁–C₄ in Appendix A.5, $\sqrt{n} ({\hat{Θ}}_{1} - Θ) = O (1)$ and $\sqrt{n} ({\hat{Θ}}_{2} - Θ) = O (1)$ .

In contrast to Cressie and Lahiri (1993, 1996) who showed the asymptotic property of REML estimator in a linear model with Gaussian error, we consider a semiparametric model. The proof of Theorem 2 is described in Appendix A.6. Because the intersection of Ω and the closure of neighborhoods centered about Θ₀ constitute closed subsets, this result also holds when the true parameter is on the boundary of the parameter space Ω (Vu and Zhou, 1997). In our study we use the Gaussian, polynomial, and neural network kernels. We show that Σ and Σ_τ are simultaneously diagonalizable as well as Σ and Σ_σ². For the Gaussian kernel, using Lemma 2, we show that Σ and Σ_ρ are simultaneously diagonalizable when n goes to ∞.

Theorem 3

(3.1) Suppose that the true parameters are the interior points of the parameter space. Then under regularity conditions, Θ̂₁ and Θ̂₂ are asymptotically normally distributed.

(3.2) Suppose that one component of the true parameters is a left endpoint of the parameter space. Then under regularity conditions, Θ̂₁ and Θ̂₂ are asymptotically distributed with a mixture of normal distributions.

The proof of (3.1) is explained in Appendix A.7. The proof of (3.2) can be shown by Chant’s (1974) case (i).

As for the MSEs, we compare our estimator based on our REML (2) with LLG’s estimator based on LLG’s REML (1). We prove that the estimators from our REML have asymptotically smaller MSEs than those of LLG’s REML in the following theorem.

Theorem 4

E||Θ̂₂−Θ||² < E||Θ̂₁−Θ||² asymptotically.

Its proof is given in Appendix A.8. This results implies that our score equations and information matrix are more accurate than those of Liu et al. (2007) because we take into account the relationship between β̂ and the variance components in REML.

Although estimators of parameters in the semiparametric regression approach developed by connecting a least squares kernel machine with REML have useful large sample properties, there are some situations that ρ and τ are not identifiable. This result is summarized in Theorem 5.

Theorem 5

ρ and τ are not identifiable under one of the following situations: (i) τ→0; (ii) ρ→0 and τ ~ O(1/ρ^m) for any positive m; or (iii) 1/ρ→0.

This is because the marginal distribution of Y, f(Y|τ) = f(Y|ρ) ~ N(0, σ²I) under conditions (i) and (ii) and f(Y|τ) = f(Y|ρ) ~ N(0, σ²I+τJ) under condition (iii). Its proof is given in Appendix A.9.

4. Test for the nonparametric function

Our main interest is to test H₀: r(Z) = constant, where r(·) is an unknown nonlinear smooth function under model (1), which means that a null hypothesis under the Gaussian random process using Gaussian kernel, H₀: {r(Z) is a point mass at zero} ∪ {r(Z) has a constant covariance matrix as a function of Z} is equivalent to H₀: τ/ρ = 0 or ρ = 0. For the polynomial and neural network kernels, our null hypothesis is H₀: τ = 0. The proof of this equivalence is given in Appendix A.10. We perform this testing based on two approaches: one is a profile restricted likelihood ratio test, which is described in Section 4.1, and the other is a permutation test, which is explained in Section 4.2.

4.1. Profile restricted likelihood ratio test

We derive a profile restricted likelihood ratio test (PRLRT) by taking into account that the parameter of interest under the null hypothesis is on the boundary of the parameter space and the kernel matrix K is not a block diagonal matrix. Theoretical properties of our profile REML estimators, which we have shown in Section 3, are needed for PRLRT. We can perform PRLRT test using both empirical distribution obtained by permutation as well as theoretical distribution derived in the following subsections for each null hypothesis.

4.1.1. Test for H₀: τ/ρ = 0 vs H_a: τ/ρ>0

Test for H₀ means that either H₀₁: τ = 0 and ρ is positive and bounded or H₀₂: ρ = ∞ and τ is positive and bounded. The PRLRT can be derived as follows.

Test for H₀₁: τ = 0 vs τ >0, where ρ is positive and bounded. Let Ω be the parameter space for Θ = (τ, ρ, σ²)^T. Denote Ω₀ and Ω₁ = Ω/Ω₀ as the parameter spaces under H₀ and H_a. The true parameters Θ₀ are either in the interior or on the boundary of the parameter space Ω. Assume that the parameter spaces Ω₀ and Ω₁ can be approximated at Θ₀ by cones C_Ω₀ and C_Ω₁, respectively, with vertex Θ₀.

The PRLRT statistic, D, is the deviance of two times the log PRLRT, D = 2l₂(Θ)−2l₂(Θ₀). By Claeskens (2004) who extended the non-standard LRT test to PRLRT, D converges to

D \to inf_{Θ \in {\tilde{C}}_{0}} {| | U - Θ | |}^{2} - inf_{Θ \in \tilde{C}} {| | U - Θ | |}^{2},

where C̃ = {Θ̃: Θ̃ = I(Θ₀)^T/²(Θ−Θ₀), Θ ∈ C_Ω} is the orthonormal transformation of the cone approximation, C_Ω, of the parameter space Ω with Θ₀ as the vertex, and C̃₀ = {Θ̃: Θ̃ = I(Θ₀)^T^/2(Θ−Θ₀), Θ ∈ C_Ω₀} is the orthonormal transformed cone approximation of the parameter space Ω₀ under the null hypothesis. U is a random vector from N(0, I), and I(Θ₀)^T^/2 is the right Cholesky square root of profile REML information matrix, i.e, I(Θ₀) = [I(Θ₀)]^1/2[I(Θ₀)]^T^/2.

Under the null hypothesis, Θ₀ = (0, ρ, σ²)^T, ρ is inestimable. Let Θ = (τ, σ²). Now the cone parameter spaces are reduced to C_Ω = [0, ∞) × (0, ∞). C_Ω₀ = {0} × (0, ∞) and C_Ω₁ = (0, ∞)².

Decompose normal vector U and I(Θ₀) as U = (U₁, U₂)^T and I(Θ₀) = {I_jk} corresponding to τ and σ². After some algebra, we can easily show that

inf_{Θ \in {\tilde{C}}_{0}} | | U - Θ | | = (I_{11} - I_{12} I_{22}^{- 1} I_{21}) U_{1}^{2}

and

inf_{Θ \in \tilde{C}} | | U - Θ | | = (I_{11} - I_{12} I_{22}^{- 1} I_{21}) U_{1}^{2} ({\tilde{U}}_{1} \leq 0),

where ${\tilde{U}}_{1} = {(I_{11} - I_{12} I_{22}^{- 1} I_{21})}^{1 / 2} U_{1}$ . Therefore, the difference between these two becomes $(I_{11} - I_{12} I_{22}^{- 1} I_{21}) U_{1}^{2} ({\tilde{U}}_{1} > 0)$ . Since ${(I_{11} - I_{12} I_{22}^{- 1} I_{21})}^{1 / 2} U_{1} ~ N (0, 1)$ , the asymptotic distribution of D under H₀₁ is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., $0.5 χ_{0}^{2} + 0.5 χ_{1}^{2}$ .

Test for H₀₂: ρ = ∞, where τ is positive and bounded. This test H₀₂: ρ = ∞ means that 1/ρ = 0. Let us denote that ρ^* = 1/ρ. In this case, $Θ = (θ_{1}, θ_{2}^{T})$ , where θ₁ = ρ^* and θ₂ = (τ, σ²). C_Ω₀ = {0} × (0, ∞)² and C_Ω₁ = (0, ∞)³. Similar to the previous case, decompose normal vector U and I(Θ₀) as $U = {(U_{1}, U_{2}^{T})}^{T}$ and I(Θ₀) = {I_jk} corresponding to θ₁ and θ₂. We can then show that

inf_{Θ \in {\tilde{C}}_{0}} | | U - Θ | | = (I_{11} - I_{12} I_{22}^{- 1} I_{21}) U_{1}^{2},

(4)

inf_{Θ \in \tilde{C}} | | U - Θ | | = (I_{11} - I_{12} I_{22}^{- 1} I_{21}) U_{1}^{2} ({\tilde{U}}_{1} \leq 0) .

(5)

In a similar way, the asymptotic distribution of D under H₀₂: ρ = ∞ is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., $0.5 χ_{0}^{2} + 0.5 χ_{1}^{2}$ .

4.1.2. Test for H₀: ρ = 0

Test for H₀ means that either H₀₃: ρ = 0 and τ is positive and bounded or H₀₄: ρ = 0 and τ = 0.

Test for H₀₃: ρ = 0, where τ is positive and bounded. In this case, $Θ = (θ_{1}, θ_{2}^{T})$ , where θ₁ = ρ and θ₂ = (τ, σ²). C_Ω₀ = {0} × (0, ∞)² and C_Ω₁ = (0, ∞)³. Similar to the previous case, decompose U and I(Θ₀) as $U = {(U_{1}, U_{2}^{T})}^{T}$ and I(Θ₀) = {I_jk} corresponding to θ₁ and θ₂. In a similar way, the asymptotic distribution of D under H₀₃ is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., $0.5 χ_{0}^{2} + 0.5 χ_{1}^{2}$ .

Test for H₀₄: ρ = 0 and τ = 0, where ρ < o(τ). In this case, Θ = (θ₁, θ₂, θ₃), where θ₁ = ρ, θ₂ = τ, and θ₃ = σ². C_Ω₀ = {0}² × (0, ∞) and C_Ω₁ = (0, ∞)³. Decompose normal vector U and I(Θ₀) as U = (U₁, U₂, U₃)^T and I(Θ₀) = {I_jk} corresponding to θ₁, θ₂, and θ₃. Under the orthonormal transformation, the cone spaces become to C̃ = {Θ; ηθ₂−θ₁ ≥ 0, θ₂ ≥ 0} and C̃₀ = {Θ: θ₂ = θ₁ = 0}, where η = Ĩ₁₂|Ĩ(Θ₀)|^−1/2 is the slope of the axis θ₂ after transformation and

\tilde{I} (Θ_{0}) = (\begin{matrix} {\tilde{I}}_{11} & {\tilde{I}}_{12} \\ {\tilde{I}}_{21} & {\tilde{I}}_{22} \end{matrix}) = (\begin{matrix} I_{11} & I_{12} \\ I_{21} & I_{22} \end{matrix}) - (\begin{matrix} I_{13} \\ I_{23} \end{matrix}) I_{33}^{- 1} (\begin{matrix} I_{31} & I_{32} \end{matrix}) .

We can then show that

inf_{Θ \in {\tilde{C}}_{0}} {| | U - Θ | |}^{2} = U_{1}^{2} + U_{2}^{2} .

The representations of inf_Θ_∈_C̃||U−Θ|| are different in the four regions of the plane with coordinates (θ₁, θ₂)^T

inf_{Θ \in \tilde{C}} {| | U - Θ | |}^{2} = {\begin{cases} 0, & θ_{2} \geq 0, η θ_{2} - θ_{1} \geq 0, \\ U_{1}^{2} + U_{2}^{2} - {(η U_{1} + U_{2})}^{2} / (1 + η^{2}), & θ_{2} + η θ_{1} \geq 0, η θ_{2} - θ_{1} < 0, \\ U_{2}^{2}, & θ_{2} < 0, θ_{1} \geq 0, \\ U_{1}^{2} + U_{2}^{2}, & θ_{2} + η θ_{1} < 0, θ_{1} < 0. \end{cases}

The area proportions (π, 1/4, 1/4,1/2−π) as in the aforementioned order, of these four regions determine the probabilities that the vector U lies in which region, where π = cos⁻¹{η · (1+η²)^−1/2} = Ĩ₁₂ · (Ĩ₁₁Ĩ₂₂)^−1/2. Then the asymptotic distribution of D is the difference of the above two representations

D \to {\begin{cases} U_{1}^{2} + U_{2}^{2} & with probability π, \\ {(η U_{1} + U_{2})}^{2} / (1 + η^{2}) & with probability \frac{1}{4}, \\ U_{1}^{2} & with probability \frac{1}{4}, \\ 0 & with probability \frac{1}{2} - π . \end{cases}

Because U₁ and U₂ are independent, $(η U_{1} + U_{2}) / \sqrt{1 + η^{2}} ~ N (0, 1)$ , and the final approximate asymptotic distribution of D is $η χ_{2}^{2} + 0.5 χ_{1}^{2} + (0.5 - η) χ_{0}^{2}$ .

4.2. Permutation test

For identifying significant pathways, we also consider using the following permutation procedures.

Step 1: estimate τ/ρ, ρ, τ using the observed data by fitting the semiparametric model and calculate the residuals ε̂₀_i from y_i = x_iβ + ε₀_i.
Step 2: permute the residual ${\hat{ε}}_{0 i}^{*}$ and simulate outcomes as $y_{i}^{*} = x_{i} \hat{β} + {\hat{ε}}_{0 i}^{*}$ .
Step 3: based on y^*, x, and z, fit the semiparametric model using the likelihood-based approach and then estimate τ̂^*/ρ̂^*, ρ̂^*, and τ̂^*.
Step 4: repeat Step 1–Step 3 a large number of times, e.g., 10,000 times.
Step 5: estimate the statistical significance by the percentage of times either τ̂^*/ρ̂^*>τ̂/ρ̂ or ρ̂^*>ρ̂, where τ̂/ρ̂, ρ̂, τ̂ are the estimated values from the observed data.

Significant pathways can be selected based on this percentage, which is equivalent to a statistical significance.

To rank the importance of genes within a significant pathway related to clinical outcomes, we perform the following steps:

Step 6: bootstrap the samples B times. For each bootstrapped sample, we estimate ${\hat{τ}}_{(- g)}^{b}, {\hat{ρ}}_{(- g)}^{b}$ , and ${(\hat{τ} / \hat{ρ})}_{(- g)}^{b}$ by fitting the following model, where b = 1, …, B,
$Y = X β + r {Z_{- (g)}} + ε .$
Step 7: calculate the absolute value of difference, $∣ {\hat{τ}}_{(- g)}^{b} - \hat{τ} ∣, ∣ {\hat{ρ}}_{(- g)}^{b} - \hat{ρ} ∣$ , and $∣ {(\hat{τ} / \hat{ρ})}_{(- g)}^{b} - (\hat{τ} / \hat{ρ}) ∣$ , under the Gaussian kernel. For other kernels, only calculate the absolute value of difference $∣ {\hat{τ}}_{(- g)}^{b} - \hat{τ} ∣$ .
Step 8: rank the importance of the genes by the mean absolute difference $(1 / B) \sum_{b = 1}^{B} ∣ {\hat{τ}}_{(- g)}^{b} - \hat{τ} ∣$ . If a gene plays an importance role in a pathway, this difference will be large.

We can also rank the importance of gene pairs by performing the same procedure except for fitting the following model

Y = X β + r {z_{- (g, g^{'})}} + ε .

5. Simulation studies

5.1. Mean squared error and coverage probability

We conducted simulations to compare the MSEs and coverage probabilities of LLG’s estimators with those of our estimators.

We considered the following cases and simulated 1000 data sets with two sets of n and p values. One set is (n, p) = (60, 5) with a relatively large sample size compared to the number of genes and the other set is (n, p) = (60, 200) with a relatively small sample size compared to the number of genes.

Case 1: y_i = βx_i + r(z_i₁, z_i₂, …, z_ip)+ ε_i with β = 1, x_i = 3 cos(z_i₁)+2u_i with u_i being independent of z_i₁ and following N(0, 1), z_ij ~ Uniform(0, 1), and the true r(z) ~ GP{0, τK_g₍_ρ₎}, where K_g₍_ρ₎(z, z′) = exp(−||z−z′||²/ρ) and ||·|| was used as Euclidean norm.
Case 2: the same setting as that of case 1 except that r(z) ~ GP(0, τK_p₂), where K_p₂(z, z′) = (zz′)².
Case 3: the same setting as that of case 1 except that r(z) ~ GP(0, τK_p₁), where K_p₁(z, z′) = (zz′)¹.

For each case, we estimated parameters in the semiparametric regression model using the LLG’s REML and our REML. We calculated the average MSEs and coverage probability of the parameter estimates. The results are summarized in Table 1. For both n>p and n<p cases, our REML estimators have smaller MSEs than that of LLG’s REML estimators. Both approaches have comparable coverage probability.

Table 1.

The coverage probability (cvpr) of 95% confidence intervals and average mean squared error values (mse) of parameter estimates. LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel; P2K=Quadratic kernel; P1K=Linear kernel.

The number of sample size and gene	K	Method	Case	Coverage probability and mean squared error	β	σ	τ	ρ
n=60	GK	LLIKE	1	cvpr × 100	97	96	91	90
p=5				mse	0.0053	0.0247	0.0483	0.0623
		PLIKE		cvpr	97	96	92	91
				mse	0.0048	0.0151	0.0322	0.0458
	P2K	LLIKE	2	cvpr × 100	98	93	95	N/A
				mse	0.0053	0.0485	0.0252	N/A
		PLIKE		cvpr × 100	98	93	95	N/A
				mse	0.0050	0.0474	0.0243	N/A
	P1K	LLIKE	3	cvpr × 100	97	93	95	N/A
				mse	0.0051	0.0104	0.0165	N/A
		PLIKE		cvpr × 100	97	93	95	N/A
				mse	0.0042	0.0095	0.0142	N/A
n=60	GK	LLIKE	1	cvpr × 100	89	87	87	80
p=200				mse	0.0321	0.0436	0.0424	0.1261
		PLIKE		cvpr × 100	89	89	87	80
				mse	0.0304	0.0295	0.0421	0.112
	P2K	LLIKE	2	cvpr × 100	86	63	87	N/A
				mse	0.0478	0.4683	0.0388	N/A
		PLIKE		cvpr × 100	86	63	87	N/A
				mse	0.0475	0.4620	0.0379	N/A
	P1K	LLIKE	3	cvpr × 100	86	65	82	N/A
				mse	0.0486	0.2984	0.0583	N/A
		PLIKE		cvpr	86	66	85	N/A
				mse	0.0480	0.2839	0.0564	N/A

Open in a new tab

5.2. Consistency

We also performed additional simulation to study the increased accuracy of our estimates as n increases when p is fixed. We considered n=15, 30, 60, and 600, for one small, two medium, and one relatively large sample sizes. For each case, we simulated 1000 data sets and average MSEs are summarized in Table 2.

Table 2.

The average mean squared error values (mse) of parameter estimates for different sample sizes. LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel.

The number of sample size and gene	K	Method	Case	Coverage probability and mean squared error	β	σ	τ	ρ
n=15	GK	LLIKE	1	mse	0.0865	0.0774	0.0799	0.0974
p=5		PLIKE		mse	0.0741	0.0643	0.0673	0.0791
n=30	GK	LLIKE	1	mse	0.0067	0.0319	0.0529	0.0723
p=5		PLIKE		mse	0.0062	0.0234	0.0436	0.0558
n=60	GK	LLIKE	1	mse	0.0053	0.0247	0.0483	0.0623
p=5		PLIKE		mse	0.0048	0.0151	0.0322	0.0458
n=600	GK	LLIKE	1	mse	0.0025	0.0114	0.0235	0.0312
p=5		PLIKE		mse	0.0020	0.0107	0.0197	0.0259

Open in a new tab

5.3. Type I error and power

For the assessment of type I error and power for both approaches, we considered the following Case 4 for type I error and Case 5 for power, respectively. We consider Case 5 because the nonparametric function r(z_i₁, z_i₂, …, z_ip) is allowed to have a complex form with nonlinear functions of the z’s and interactions among the z’s. It allows for x_i and (z_i₁, …, z_ip) to be correlated. We obtained type I error and power using a profile restricted likelihood ratio test (PRLRT) described in Section 4.1. We performed PRLRT using both empirical distribution as well as theoretical distribution. We denote them as “PRLRT(e)” and “PRLRT(t)”, respectively. We also compared them with the permutation test described in Section 4.2. This permutation is denoted as “PERM”.

Case 4: y_i = βx_i +ε_i with β = 1, x_i = 3 cos(π/6)+ 2u_i with u_i ~ N(0, 1) and ε_i ~ N(0, σ²).
Case 5: y_i = βx_i +r(z_i₁, z_i₂, …, z_ip)+ε_i, x_i = 3 cos(z_i₁)+ 2u_i with u_i being independent of z_i₁ and following N(0, 1), z_ij ~ Uniform(0, 1), and $r (z) = 10 cos (z_{1}) - 15 z_{2}^{2} + 10 exp (- z_{3}) z_{4} - 8 sin (z_{5}) cos (z_{3}) + 20 z_{1} z_{2}$ .

The estimated type I error rates from both approaches are summarized in Tables 3–5 and they are all close to the nominal level when n>p. However, when n<p, both methods differ from the nominal level. The power for both approaches are comparable. The performance between “PRLRT (e)” and “PRLRT (t)” was similar to each other, so was the performance between “PRLRT” and “PERM”.

Table 3.

Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.

The number of sample size and gene	Type I error and power	Kernel (GK)	LLIKE			PLIKE
The number of sample size and gene	Type I error and power	Kernel (GK)	PERM	PRLRT(e)	PRLRT(t)	PERM	PRLRT(e)	PRLRT(t)
		Case
n = 60	Type I	4	0.04	0.05	0.05	0.04	0.05	0.05
p = 5	Power	5	0.99	0.99	1	0.99	0.99	1
n = 60	Type I	4	0.03	0.04	0.04	0.03	0.04	0.04
p = 200	Power	5	0.84	0.84	0.84	0.84	0.85	0.85

Open in a new tab

Table 5.

Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; P1K=Linear kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.

The number of sample size and gene	Type I error and power	Kernel (P1K)	LLIKE			PLIKE
The number of sample size and gene	Type I error and power	Kernel (P1K)	PERM	PRLRT(e)	PRLRT(t)	PERM	PRLRT(e)	PRLRT(t)
		Case
n = 60	Type I	4	0.04	0.05	0.05	0.04	0.05	0.05
p = 5	Power	5	0.97	1	1	0.97	1	1
n = 60	Type I	4	0.03	0.04	0.04	0.03	0.04	0.04
p = 200	Power	5	0.82	0.83	0.83	0.82	0.83	0.83

Open in a new tab

6. Real data analysis

6.1. Pathway based analysis for type II diabetes

We applied our profile likelihood approach to a microarray expression data set on type II diabetes (Mootha et al., 2003), where a microarray expression data from 22,283 genes were measured in 17 male patients with normal glucose tolerance and 18 male samples with type II diabetes mellitus. Because their approach is based on a normalized Kolmogorov–Smirnov statistics, their method is unable to model the continuous outcome, such as glucose level, and clinical covariates such as age. Incorporating such information into analysis might be helpful to more efficiently detect subtle differences in gene expression profiles. We studied a total of 277 pathways consisting of 128 KEGG pathways (http://www.genome.jp/kegg/pathway.html) and 149 curated pathways. The 149 curated pathways were constructed from known biological experiments by Mootha and colleagues.

In our analysis, let Y be the log transformed glucose level, X be the age, and Z be the p × n gene expression levels within each pathway, where n is 35, i.e. the number of subjects, p is the number of genes in a specific pathway which varied from 4 to 200 across these pathways. Our goal is to identify pathways that affect the glucose level related to diabetes after adjusting for the age effect and also to rank genes within a significant pathway. To identify significant pathways, we fitted the semiparametric model.

6.2. Identifying significant pathways

We chose the 0.05 cut off for the statistical significance using both theoretical and empirical distributions of PRLRT and the percentage which was described in Section 4.2. We also applied existing multiple comparison methods (Storey, 2002, 2003) to our pathway data, although our pathways are not independent of each other because of shared genes and interactions among pathways. The FDR q-values were between 0.081 and 0.303. Our approach using both theoretical and empirical distributions took about 60 min to run, using a Mac Pro with two 3.0 GHz Quad-Core Intel Xeon processors, and 10 GB of memory. Our code is written in Matlab and is available upon request.

For overlapping pathways across four kernels, we selected top 50 pathways for each kernel and then examined the common pathways across four kernels. We found that a total of seven pathways were common among the four kernels, including Alanine and aspartate metabolism, Oxidative phosphorylation, and RNA polymerase pathways. Since one of them is a subset of another pathway, six pathways are summarized in Table 6.

Table 6.

Seven pathways which are significant over four kernels (Linear, Quadratic, Gaussian, Neural network kernels) using pathway data of type II diabetes; P1K=linear kernel; P2K=quadratic kernel; GK=Gaussian kernel; NNK=neural network kernel; P-values are obtained using both profile restricted likelihood ratio test (PRLRT) and permutation test (PERM); PRLRT(t) is a profile restricted likelihood ratio test which is performed using theoretical distribution.

Pathway ID	Pathway name (# of genes)	P1K		P2K		GK		NNK
Pathway ID	Pathway name (# of genes)	PERM	PRLRT(t)	PERM	PRLRT(t)	PERM	PRLRT(t)	PERM	PRLRT(t)
4	Alanine and aspartate metabolism (18)	0.009	0.008	0.003	0.002	0.034	0.030	0.016	0.014
36	c17_U133_probes (116)	0.029	0.031	0.002	0.001	0.003	0.002	0.015	0.013
133	MAP00190_Oxidative_phosphorylation (58)	0.025	0.027	0.007	0.007	0.032	0.029	0.022	0.025
229	Oxidative_phosphorylation (113)	0.023	0.021	0.008	0.009	0.035	0.039	0.024	0.022
209	MAP03020_RNA_polymerase (21)	0.034	0.031	0.027	0.026	0.040	0.041	0.025	0.026
254	RNA polymerase (25)	0.037	0.038	0.028	0.023	0.041	0.043	0.022	0.021

Open in a new tab

Pathway 4 is related to the Alanine and aspartate metabolism pathway which has been studied for its association with abnormal hepatocellular functions and abnormal fasting glucose level alanine and aspartate metabolism in type II diabetes (Jiamjarasrangsi et al., 2009). Two other pathways out of six pathways, pathways 133 and 229, where all except one gene in pathway 133 belong to pathway 229, are related to the Oxidative phosphorylation which is known to be associated to diabetes (Misu et al., 2007; Mootha et al., 2003, 2004). This is a process of cellular respiration in humans (or in general eukaryotes). These pathways contain coregulated genes across different tissues and are related to insulin/glucose disposal. They consist of ATP synthesis, a pathway involved in energy transfer. The other two pathways, pathways 209 and 254, are related to RNA polymerase. All genes except two genes in pathway 209 are part of pathway 254. Among all the pathways, pathway 36, c17_U133_probes, is the most significant one under the Gaussian kernel. This pathway plays a role in cellular behavioral changes (Saxena, 2001) and is also one of the seven pathways that are common among the four kernels. It contains several genes, e.g. CAP1, MAPP2K6, ARF6, and SGK, related to human insulin signaling (Dahlquist et al., 2002). Only one gene in pathway 36 is included in pathway 254. These genes were not significant using single gene based analysis. We also note that all genes in the Oxidative phosphorylation pathway were not significant using single gene based analysis. Only one gene, GAD2, in pathway 4 is significant using single gene based analysis.

We compared the top 50 pathways identified by global test (Goeman et al., 2004) and GSEA (Subramanian et al., 2005). Global test is based on a random effects model which does not include covariates and uses a linear kernel. GSEA calculates enrichment score which is weighed function of correlation among genes in a pathway. This GSEA method cannot include covariate information into this enrichment score value. The GSEA result and global test results are summarized in supplementary materials.

The proportions of overlap among global, GSEA, and our approach were calculated. The proportions of overlap between the global and GSEA was 0.36. The largest proportion among the GSEA and our approaches was 0.41 using the quadratic kernels. These results suggest that the proportions of overlap among different methods were small, meaning that they detected different pathways. When our approach was used with four different kernels, the largest overlap was 0.92 between linear and quadratic kernels. Global test and GSEA approaches found that pathways 4, 140, and 229 are significant. But they could not detect pathways 36, 133, and 254, while our approach detects them all.

For kernel selection, we used Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978), where AIC = n log {(Y−Ŷ)^T(Y−Ŷ)}+2r and BIC = n log{(Y−Ŷ)^T(Y−Ŷ)}+ r log(n), Ŷ = LY, L = (I+λ⁻¹K)⁻¹[λ⁻¹K+X{X^T(I+λ⁻¹K)⁻¹X}⁻¹X^T(I+λ⁻¹K)⁻¹], and r = rank(L).

We found that all seven pathways are in the top 15 pathway list having the smallest AIC and BIC values for all kernels. Pathway 36, c17_U133_probes, has the smallest AIC and BIC values with the Gaussian kernel among four kernels, whereas pathway 229, Oxidative phosphorylation, has the smallest values with the quadratic kernel among four kernels. Pathway 254, RNA polymerase, has the smallest values with the Neural network kernel among four kernels. Pathways 4, 36, and 229 have small values among all pathways with quadratic kernel.

7. Discussion

In this paper, we have derived the asymptotic properties of semiparametric regression for identifying pathway effects on clinical outcomes after controlling for covariates. We compared LLG’s REML with our REML obtained from a profile $\sqrt{n}$ likelihood. We showed that the estimators obtained from our REML and LLG’s REML are both consistent, have n convergence, and asymptotically follow normal distribution when the true parameters are interior points of the parameter space and a mixture of normal distribution when the one component of true parameters is a left endpoint of the parameter space. However, our REML gives more accurate score equations and information matrix, and have smaller MSEs than those of LLG’s REML. A profile restricted likelihood ratio test is also provided for the non-standard testing problem in our application.

The choice of an appropriate kernel poses a significant practical issue. We used kernel selection approaches based on AIC and BIC (Liu et al., 2007). Kim et al. (2012) proposed Bayes factor-based approach on kernel selection. However, this Bayes factor-based approach is computational expensive. Kernel selection can be considered as a model selection problem with the kernal machine framework. More general and flexible methods on model selection for covariance matrix estimation may be explored here and are worth future research.

We note that we analyze each pathway separately in our analysis. It is known that pathways are not independent of each other because of shared genes and interactions among pathways, which makes difficult to adjust p-value. Because existing multiple comparison methods based on false discovery rates (Benjamini and Hochberg, 1995; Storey, 2002, 2003) are developed in single gene based analysis under independence assumption or known positive dependence structure among genes, they are not applicable in a pathway based analysis, where pathways are not independent of each other. Developing such multiple comparison method for pathway based analysis will be a challenging problem because of complex dependence structure among pathways.

It is also important to generalize the semiparametric model (Hastie and Tibshirani, 1990; Tibshirani, 1996) to incorporate multiple pathways using generalized additive model and the multivariate adaptive regression splines (Friedman, 1991).

Supplementary Material

Appendix B

NIHMS455567-supplement-Appendix_B.pdf^{(98.5KB, pdf)}

Table 4.

Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; P2K=Quadratic kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.

The number of sample size and gene	Type I error and power	Kernel (P2K)	LLIKE			PLIKE
The number of sample size and gene	Type I error and power	Kernel (P2K)	PERM	PRLRT(e)	PRLRT(t)	PERM	PRLRT(e)	PRLRT(t)
		Case
n = 60	Type I	4	0.04	0.05	0.05	0.04	0.05	0.05
p = 5	Power	5	0.97	1	1	0.98	1	1
n = 60	Type I	4	0.03	0.04	0.04	0.03	0.04	0.04
p = 200	Power	5	0.83	0.85	0.85	0.84	0.85	0.85

Open in a new tab

Acknowledgments

This study was supported in part by NIH Grants GM-59507, N01-HV-28186 and P30-DA-18343, National Science Foundation Grant DMS 1106738, and a pilot grant from the Yale Pepper Center P30AG021342.

Appendix A. Technical complements

A.1. Score equations and information matrix

We introduce the following notations to simplify our discussion:

\begin{array}{l} K (z, z^{'}) = exp {- \sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ}}, \\ K_{ρ} (z, z^{'}) = \frac{\partial K (z, z^{'})}{\partial ρ} = exp {- \sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ}} {\sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ^{2}}}, \\ K_{ρ ρ} (z, z^{'}) = \frac{\partial^{2} K (z, z^{'})}{\partial ρ} = exp [- \sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ} {\sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ^{2}}}^{2}] - 2 exp {- \sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ}} {\sum_{k = 1}^{p} \frac{{(z_{k} - z_{k}^{'})}^{2}}{ρ^{3}}}, \end{array}

where K_ρ(·, ·) and K_ρρ(·, ·) are the first and second derivatives of K(·, ·) with respect to ρ. Let K, K_ρ, and K_ρ_,_ρ be n × n matrices. Their entries consist of K(·, ·), K_ρ(·, ·), and K_ρρ(·, ·), respectively. We recall the following notations:

\begin{array}{l} \sum = σ^{2} I + τ K = \sum (Θ), \\ P = \sum^{- 1} - \sum^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1}, H = I - X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1} . \end{array}

Based on these notations and the calculations of

\frac{\partial \sum^{- 1}}{\partial Θ} = - \sum^{- 1} \sum_{Θ} \sum^{- 1} and \frac{\partial log ∣ \sum ∣}{\partial Θ} = \sum^{- 1} \sum_{Θ}

we obtain the first and second derivatives of Σ and Σ⁻¹ with respect to τ as follows Σ_τ = K, Σ_ττ = 0 × I, $\sum_{τ}^{- 1} = - \sum^{- 1} \sum_{τ} \sum^{- 1}$ , and $\sum_{τ τ}^{- 1} = - (\sum_{τ}^{- 1} \sum_{τ} \sum^{- 1} + \sum^{- 1} \sum_{τ τ} \sum^{- 1} + \sum^{- 1} \sum_{τ} \sum_{τ}^{- 1})$ .

We also calculate the first and second derivatives of Σ and Σ⁻¹ with respect to ρ as follows: Σ_ρ = τK_ρ, Σ_ρρ = τK_ρρ, $\sum_{ρ}^{- 1} = - \sum^{- 1} \sum_{ρ} \sum^{- 1}$ , and $\sum_{ρ ρ}^{- 1} = - (\sum_{ρ}^{- 1} \sum_{ρ} \sum^{- 1} + \sum^{- 1} \sum_{ρ ρ} \sum^{- 1} + \sum^{- 1} \sum_{ρ} \sum_{ρ}^{- 1})$ .

The first and second derivatives of Σ and Σ⁻¹ with respect to σ² are Σ_σ₂ = I, Σ_σ²σ² = 0 × I, $\sum_{σ^{2}}^{- 1} = - \sum^{- 1} \sum_{σ^{2}} \sum^{- 1}$ , and $\sum_{σ^{2} σ^{2}}^{- 1} = - (\sum_{σ^{2}}^{- 1} \sum_{σ^{2}} \sum^{- 1} + \sum^{- 1} \sum_{σ^{2} σ^{2}} \sum^{- 1} + \sum^{- 1} \sum_{σ^{2}} \sum_{σ^{2}}^{- 1})$ .

We obtain the second derivatives of Σ and Σ⁻¹ with respect to τ and ρ, τ and σ², and ρ and σ² as follows:

\begin{array}{l} \sum_{τ ρ} = K_{ρ}, \sum_{τ σ^{2}} = 0 \times I, \sum_{ρ σ^{2}} = 0 \times I, \sum_{τ ρ}^{- 1} = - (\sum_{ρ}^{- 1} \sum_{τ} \sum^{- 1} + \sum^{- 1} \sum_{τ ρ} \sum^{- 1} + \sum^{- 1} \sum_{τ} \sum_{ρ}^{- 1}), \\ \sum_{τ σ^{2}}^{- 1} = - (\sum_{σ^{2}}^{- 1} \sum_{τ} \sum^{- 1} + \sum^{- 1} \sum_{τ σ^{2}} \sum^{- 1} + \sum^{- 1} \sum_{τ} \sum_{σ^{2}}^{- 1}), and \sum_{ρ σ^{2}}^{- 1} = - (\sum_{σ^{2}}^{- 1} \sum_{ρ} \sum^{- 1} + \sum^{- 1} \sum_{ρ σ^{2}} \sum^{- 1} + \sum^{- 1} \sum_{ρ} \sum_{σ^{2}}^{- 1}) . \end{array}

The first derivatives of X^TΣX with respect to τ, ρ, σ² are

\begin{array}{l} {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} = - {(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}, \\ {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} = - {(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}, \\ {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} = - {(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{σ^{2}}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1} . \end{array}

The second derivatives of (XΣX)⁻¹ with respect to τ, ρ, σ² are

\begin{array}{l} {(X^{T} \sum^{- 1} X)}_{τ τ}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{τ}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{τ}^{- 1}}], \\ {(X^{T} \sum^{- 1} X)}_{ρ ρ}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} (X^{T} \sum_{ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{ρ ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1}}], \\ {(X^{T} \sum^{- 1} X)}_{σ^{2} σ^{2}}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} (X^{T} \sum_{σ^{2}}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{σ^{2} σ^{2}}^{- 1} X) {(X^{T} {(\sum)}^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{σ^{2}}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1}}], \\ {(X^{T} \sum X)}_{τ ρ}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1}}], \\ {(X^{T} \sum X)}_{τ σ^{2}}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ σ^{2}}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{τ}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1}}], \\ {(X^{T} \sum X)}_{ρ σ^{2}}^{- 1} = - [{{(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} (X^{T} \sum_{ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{ρ σ^{2}}^{- 1} X) {(X^{T} \sum^{- 1} X)}^{- 1}} + {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum_{ρ}^{- 1} X) {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1}}] . \end{array}

The first and second derivatives of H with respect to τ are

\begin{array}{l} H_{τ} = - {X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum^{- 1} + X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{τ}^{- 1}}, \\ H_{τ τ} = - [{X {(X^{T} \sum^{- 1} X)}_{τ τ}^{- 1} X^{T} \sum^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum_{τ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{τ τ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum_{τ}^{- 1}}] . \end{array}

The first and second derivatives of H with respect to ρ are

\begin{array}{l} H_{ρ} = - {X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum^{- 1} + X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{ρ}^{- 1}}, \\ H_{ρ ρ} = - [{X {(X^{T} \sum^{- 1} X)}_{ρ ρ}^{- 1} X^{T} \sum^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum_{ρ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{ρ ρ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum_{ρ}^{- 1}}] . \end{array}

The first and second derivatives of H with respect to σ² are

\begin{array}{l} H_{σ^{2}} = - {X {(X^{T} \sum X)}_{σ^{2}}^{- 1} X^{T} \sum^{- 1} + X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}}, \\ H_{σ^{2} σ^{2}} = - [{X {(X^{T} \sum^{- 1} X)}_{σ^{2} σ^{2}}^{- 1} X^{T} \sum^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{σ^{2} σ^{2}}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{σ^{2} σ^{2}}^{- 1}} + {X {(\sum^{- 1})}_{σ^{2}}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}}] . \end{array}

The second derivatives of H with respect to τ and ρ, τ and σ², ρ and σ² are

\begin{array}{l} H_{τ ρ} = - {(X {(X^{T} \sum^{- 1} X)}_{τ ρ}^{- 1} X^{T} \sum^{- 1}) + (X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum_{ρ}^{- 1}) + (X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum_{τ}^{- 1}) + (X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{τ ρ}^{- 1})}, \\ H_{τ σ^{2}} = - [{X {(X^{T} \sum^{- 1} X)}_{τ σ^{2}}^{- 1} X^{T} \sum^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}} + {X {(X^{T} \sum X)}_{σ^{2}}^{- 1} X^{T} \sum_{τ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{τ σ^{2}}^{- 1}}], \\ H_{ρ σ^{2}} = - [{X {(X^{T} \sum^{- 1} X)}_{ρ σ^{2}}^{- 1} X^{T} \sum^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} X^{T} \sum_{ρ}^{- 1}} + {X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{ρ σ^{2}}^{- 1}}] . \end{array}

The first derivatives of P with respect to τ, ρ, and σ² are

\begin{array}{l} P_{τ} = \sum_{τ}^{- 1} - [{\sum_{τ}^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{τ}^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}_{τ}^{- 1} X^{T} \sum^{- 1}}], \\ P_{ρ} = \sum_{ρ}^{- 1} - [{\sum_{ρ}^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{ρ}^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}_{ρ}^{- 1} X^{T} \sum^{- 1}}], \\ P_{σ^{2}} = \sum_{σ^{2}}^{- 1} - [{\sum_{σ^{2}}^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum_{σ^{2}}^{- 1}} + {\sum^{- 1} X {(X^{T} \sum^{- 1} X)}_{σ^{2}}^{- 1} X^{T} \sum^{- 1}}] . \end{array}

The score equations based on our REML l₂(Θ) are

\begin{array}{l} \frac{\partial l_{2} (Θ)}{\partial τ} = - \frac{1}{2} tr (K P) - \frac{1}{2} Y^{T} {(H_{τ}^{T} \sum^{- 1} H) + (H^{T} \sum^{- 1} H_{τ}) + (H^{T} \sum_{τ}^{- 1} H)} Y, \\ \frac{\partial l_{2} (Θ)}{\partial ρ} = - \frac{1}{2} tr {(τ K_{ρ}) P} - \frac{1}{2} Y^{T} {(H_{ρ}^{T} \sum^{- 1} H) + (H^{T} \sum^{- 1} H_{ρ}) + (H^{T} \sum_{ρ}^{- 1} H)} Y, \\ \frac{\partial l_{2} (Θ)}{\partial σ^{2}} = - \frac{1}{2} tr (P) - \frac{1}{2} Y^{T} {(H_{σ^{2}}^{T} \sum^{- 1} H) + (H^{T} \sum^{- 1} H_{σ^{2}}) + (H^{T} \sum_{σ^{2}}^{- 1} H)} Y . \end{array}

The second derivatives of l₂(Θ) with respect to τ, ρ, and σ² are

\begin{array}{l} \frac{\partial^{2} l_{2} (Θ)}{\partial τ^{2}} = - \frac{1}{2} tr (K P_{τ}) - \frac{1}{2} Y^{T} {(H_{τ τ}^{T} \sum^{- 1} H) + (H^{T} \sum_{τ τ}^{- 1} H) + (H^{T} \sum^{- 1} H_{τ τ}) + 2 (H_{τ}^{T} \sum^{- 1} H_{τ} + H_{τ}^{T} \sum_{τ}^{- 1} H + H^{T} \sum_{τ}^{- 1} H_{τ})} Y, \\ \frac{\partial^{2} l_{2} (Θ)}{\partial ρ^{2}} = - \frac{1}{2} tr {τ (K_{ρ ρ} P + K_{ρ} P_{ρ})} - \frac{1}{2} Y^{T} {(H_{ρ ρ}^{T} \sum^{- 1} H) + (H^{T} \sum_{ρ ρ}^{- 1} H) + (H^{T} \sum^{- 1} H_{ρ ρ}) + 2 (H_{ρ}^{T} \sum^{- 1} H_{ρ} + H_{ρ}^{T} \sum_{ρ}^{- 1} H + H^{T} \sum_{ρ}^{- 1} H_{ρ})} Y, \\ \frac{\partial^{2} l_{2} (Θ)}{\partial {(σ^{2})}^{2}} = - \frac{1}{2} tr (P_{σ^{2}}) - \frac{1}{2} Y^{T} {(H_{σ^{2} σ^{2}}^{T} \sum^{- 1} H) + (H^{T} \sum_{σ^{2} σ^{2}}^{- 1} H) + (H^{T} \sum^{- 1} H_{σ^{2} σ^{2}}) + 2 (H_{σ^{2}}^{T} \sum^{- 1} H_{σ^{2}} + H_{σ^{2}}^{T} \sum_{σ^{2}}^{- 1} H + H^{T} \sum_{σ^{2}}^{- 1} H_{σ^{2}})} Y, \\ \frac{\partial^{2} l_{2} (Θ)}{\partial τ ρ} = - \frac{1}{2} tr (K_{ρ} P + K P_{ρ}) - \frac{1}{2} Y^{T} {(H_{τ ρ}^{T} \sum^{- 1} H) + (H^{T} \sum_{τ ρ}^{- 1} H) + (H^{T} \sum^{- 1} H_{τ ρ}) + (H_{τ}^{T} \sum_{ρ}^{- 1} H) + (H_{τ}^{T} \sum^{- 1} H_{ρ}) + (H_{ρ}^{T} \sum_{τ}^{- 1} H) + (H^{T} \sum_{τ}^{- 1} H_{ρ}) + (H_{ρ}^{T} \sum^{- 1} H_{τ}) + (H^{T} \sum_{ρ}^{- 1} H_{τ})} Y, \\ \frac{\partial^{2} l_{2} (Θ)}{\partial τ σ^{2}} = - \frac{1}{2} tr (K P_{σ^{2}}) - \frac{1}{2} Y^{T} {(H_{τ σ^{2}}^{T} \sum^{- 1} H) + (H^{T} \sum_{τ σ^{2}}^{- 1} H) + (H^{T} \sum^{- 1} H_{τ σ^{2}}) + (H_{τ}^{T} \sum_{σ^{2}}^{- 1} H) + (H_{τ}^{T} \sum^{- 1} H_{σ^{2}}) + (H_{σ^{2}}^{T} \sum_{τ}^{- 1} H) + (H^{T} \sum_{τ}^{- 1} H_{σ^{2}}) + (H_{σ^{2}}^{T} \sum^{- 1} H_{τ}) + (H^{T} \sum_{σ^{2}}^{- 1} H_{τ})} Y, \\ \frac{\partial^{2} l_{2} (Θ)}{\partial ρ σ^{2}} = - \frac{1}{2} tr (τ K_{ρ} P_{σ^{2}}) - \frac{1}{2} Y^{T} {(H_{ρ σ^{2}}^{T} \sum^{- 1} H) + (H^{T} \sum_{ρ σ^{2}}^{- 1} H) + (H^{T} \sum^{- 1} H_{ρ σ^{2}}) + (H_{ρ}^{T} \sum_{σ^{2}}^{- 1} H) + (H_{ρ}^{T} \sum^{- 1} H_{σ^{2}}) + (H_{σ^{2}}^{T} \sum_{ρ}^{- 1} H) + (H^{T} \sum_{ρ}^{- 1} H_{σ^{2}}) + (H_{σ^{2}}^{T} \sum^{- 1} H_{ρ}) + (H^{T} \sum_{σ^{2}}^{- 1} H_{ρ})} Y . \end{array}

The information matrix based on our REML l₂(Θ) is

I_{θ} = (\begin{matrix} \frac{\partial^{2} l_{2} (Θ)}{\partial τ^{2}} & \frac{\partial^{2} l_{2} (Θ)}{\partial τ_{ρ}} & \frac{\partial^{2} l_{2} (Θ)}{\partial τ σ^{2}} \\ \frac{\partial^{2} l_{2} (Θ)}{\partial τ_{ρ}} & \frac{\partial^{2} l_{2} (Θ)}{\partial ρ^{2}} & \frac{\partial^{2} l_{2} (Θ)}{\partial (σ^{2} ρ)} \\ \frac{\partial^{2} l_{2} (Θ)}{\partial τ σ^{2}} & \frac{\partial^{2} l_{2} (Θ)}{\partial ρ σ^{2}} & \frac{\partial^{2} l_{2} (Θ)}{\partial {(σ^{2})}^{2}} \end{matrix}) .

Also the score equations and information matrix of Liu et al. (2007) are

\begin{array}{l} \frac{\partial l_{1} (Θ)}{\partial τ} = - \frac{1}{2} tr (K P) + \frac{1}{2} {(Y - X \hat{β})}^{T} \sum^{- 1} K \sum^{- 1} (Y - X \hat{β}), \\ \frac{\partial l_{1} (Θ)}{\partial ρ} = - \frac{1}{2} tr {τ K_{ρ} P} + \frac{1}{2} {(Y - X \hat{β})}^{T} \sum^{- 1} (τ K_{ρ}) \sum^{- 1} (Y - X \hat{β}), \\ \frac{\partial l_{1} (Θ)}{\partial σ^{2}} = - \frac{1}{2} tr (P) + \frac{1}{2} {(Y - X \hat{β})}^{T} \sum^{- 1} \sum^{- 1} (Y - X \hat{β}), \end{array}

and

I_{Θ} = {\frac{1}{2} tr (P \sum_{Θ_{l}} P \sum_{Θ_{l^{'}}})} .

A.2. Regularity conditions

RC1

The observations D = (X, Z, Y) have probability density f(D, Θ) with respect to some measure μ and parameter Θ = (θ₁, …, θ_k). f(D, Θ) has a common support and the model is identifiable. Furthermore, the first and second derivatives of log(f) satisfy

E_{Θ} {\frac{\partial log f (D, Θ)}{\partial θ_{i}}} = 0 for i = 1, \dots, k

and

I_{i j} (Θ) = E_{Θ} {\frac{\partial}{\partial θ_{i}} log f (D, Θ) \frac{\partial}{\partial θ_{j}} log f (D, Θ)} = E_{Θ} {- \frac{\partial^{2}}{\partial θ_{i} θ_{j}} log f (D, Θ)} .

RC2

The Fisher information matrix

I (Θ) = E [{\frac{\partial}{\partial Θ} log f (D, Θ)} {\frac{\partial}{\partial Θ} log f (D, Θ)}^{T}]

is finite and positive definite at Θ = Θ₀, where Θ₀ is the true parameter value.

RC3

There exists an open subset ω of Θ containing the true parameter vector Θ₀ such that for almost all D the density f(D, Θ) admits all third derivatives

\frac{\partial^{3} log f (D, Θ)}{\partial θ_{i} θ_{j} θ_{l}} for all Θ \in ω .

Furthermore there exists functions M_ijl such that

| \frac{\partial^{3} log f (D, Θ)}{\partial θ_{i} θ_{i} θ_{l}} | \leq M_{ijl} (D) for all Θ \in ω,

where m_ijl = E_Θ₀[M_ijl(D)] < ∞ for i, j, l.

RC4

Near Θ₀ the parameter space Ω behaves like a closed set; this condition was used by Self and Liang (1987) and Vu and Zhou (1997).

A.3. Proof of Theorem 1

Recall that LLG’s REML and our REML are denoted as l₁(·) and l₂(·), and Θ̂₁ and Θ̂₂ denote LLG’s estimator and our estimator, respectively. In the following derivation, we use l(·) and Θ̂ to denote the loglikelihood and estimator of either approach to simplify discussion. The true parameter value is denoted by Θ₀.

Let α_n = n^−1/2 + a_n, where

a_{n} = \frac{1}{n} max {\frac{\partial log (∣ X^{T} \sum^{- 1} X ∣)}{\partial Θ}} = \frac{1}{n} max {{(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum^{- 1} \sum_{Θ} \sum^{- 1} X)} = o (1) .

We want to show that for any ε > 0 there exists a constant C such that

P {w : sup_{| | u | | = C} l (Θ_{0} + α_{n} u) < l (Θ_{0})} \geq 1 - ε .

(A.1)

This implies with probability at least 1−ε that there exists a local maximum in the ball {Θ₀ + α_nu: ||u|| ≤ C}. Hence, there exists a local maximizer Θ̂ such that ||Θ̂ − Θ₀|| = O(α_n). Since the intersection of the parameter space Ω and the closure of a neighborhood about Θ₀ constitute of closed intervals, there also exist a local maximizer on this set although Θ₀ is on the boundary of parameter space Ω. Let L′(Θ₀) be the gradient vector of loglikelihood function L. By the standard argument on the Taylor expansion of the likelihood function and if a_n = o(1), we have

\begin{array}{l} l (Θ_{0} + α_{n} u) - l (Θ_{0}) = L (Θ_{0} + α_{n} u) - L (Θ_{0}) + \frac{1}{2} log (∣ X^{T} \sum^{- 1} (Θ_{0} + α_{n} u) X ∣) - \frac{1}{2} log (∣ X^{T} \sum^{- 1} (Θ_{0}) X ∣) \\ \leq α_{n} L^{'} {(Θ_{0})}^{T} u - \frac{1}{2} u^{T} I (Θ_{0}) u n α_{n}^{2} {1 + o (1)} + \frac{1}{2} log (\frac{∣ X^{T} \sum^{- 1} (Θ_{0} + α_{n} u) X ∣}{∣ X^{T} \sum^{- 1} (Θ_{0}) X ∣}) . \end{array}

(A.2)

Note that n^−1/2L′(Θ₀) = O(1). Thus, the first term on the right-hand side of (A.2) is on the order $O (n^{1 / 2} α_{n}) = O (n α_{n}^{2})$ . By choosing a sufficiently large C, the second term dominates the first term uniformly in ||u|| = C. The third term is also dominated by the second term. Hence, by choosing a sufficiently large C, (A.1) holds when Θ₀ is interior of Ω as well as when Θ₀ is on the boundary of Ω. This completes the proof of (1.1) in Theorem 1.

A.4. Proof of Lemma1

Since Σ(Θ) is symmetric matrix, we can express it as

\sum (Θ) = (\begin{matrix} λ_{1 n} (Θ) & 0 & \dots & 0 \\ 0 & λ_{2 n} (Θ) & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & λ_{n n} (Θ) \end{matrix}),

where λ_in(Θ) denotes the ordered eigenvalues of Σ(Θ) and Θ = (θ₁, …, θ_k). For easy notation, we let λ_in(Θ) ≡ λ_in and Σ(Θ) ≡ Σ. We define

\frac{\partial \sum}{\partial θ_{k}} = \sum_{k} = (\begin{matrix} \partial_{k} (λ_{1 n}) & 0 & \dots & 0 \\ 0 & \partial_{k} (λ_{2 n}) & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \partial_{k} (λ_{n n}) \end{matrix}) .

Since

\begin{array}{l} P = \sum^{- 1} - \sum^{- 1} X {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} \sum^{- 1} \\ = (\begin{matrix} λ_{1 n}^{- 1} & 0 & \dots & 0 \\ 0 & λ_{2 n}^{- 1} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & λ_{n n}^{- 1} \end{matrix}) - \frac{1}{(\frac{x_{1}^{2}}{λ_{1}} + \dots + \frac{x_{n}^{2}}{λ_{n}})} (\begin{matrix} λ_{1 n}^{- 1} & 0 & \dots & 0 \\ 0 & λ_{2 n}^{- 1} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & λ_{n n}^{- 1} \end{matrix}) (\begin{matrix} \cdot & \cdot & \cdot \\ \cdot & x_{i} x_{j} & \cdot \\ \cdot & \cdot & \cdot \end{matrix}) (\begin{matrix} λ_{1 n}^{- 1} & 0 & \dots & 0 \\ 0 & λ_{2 n}^{- 1} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & λ_{n n}^{- 1} \end{matrix}), \end{array}

we have

\begin{array}{l} P \sum_{k} = [(\begin{matrix} λ_{1 n}^{- 1} & 0 & \dots & 0 \\ 0 & λ_{2 n}^{- 1} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & λ_{n n}^{- 1} \end{matrix}) - \frac{1}{(\frac{x_{1}^{2}}{λ_{1 n}} + \dots + \frac{x_{n}^{2}}{λ_{n n}})} (\begin{matrix} \cdot & \cdot & \cdot \\ \cdot & \frac{x_{i} x_{j}}{λ_{i n} λ_{j n}} & \cdot \\ \cdot & \cdot & \cdot \end{matrix})] (\begin{matrix} \partial_{k} (λ_{1 n}) & 0 & \dots & 0 \\ 0 & \partial_{k} (λ_{2 n}) & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \partial_{k} (λ_{n n}) \end{matrix}) \\ = (\begin{matrix} \frac{\partial_{k} (λ_{1 n})}{λ_{1 n}} & 0 & \dots & 0 \\ 0 & \frac{\partial_{k} (λ_{2 n})}{λ_{2 n}} & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \frac{\partial_{k} (λ_{n n})}{λ_{n}} \end{matrix}) - \frac{1}{(\frac{x_{1}^{2}}{λ_{1 n}} + \dots + \frac{x_{n}^{2}}{λ_{n n}})} (\begin{matrix} \cdot & \cdot & \cdot \\ \cdot & \frac{X_{i} X_{j} \partial_{k} λ_{j n}}{λ_{i n} λ_{j n}} & \cdot \\ \cdot & \cdot & \cdot \end{matrix}) . \end{array}

Hence, the mth diagonal element of PΣ_k is

{(P \sum_{k})}_{m m} = \frac{\partial_{k} λ_{m n}}{λ_{m n}} - \frac{1}{\sum \frac{x_{i}^{2}}{λ_{i n}}} (\frac{x_{m}^{2}}{λ_{m n}^{2}}) \partial_{k} (λ_{m n}) = (\frac{\sum_{i \neq m} \frac{x_{i}^{2}}{λ_{i n}}}{\sum_{i = 1}^{n} \frac{x_{i}^{2}}{λ_{i n}}}) \frac{\partial_{k} (λ_{m n})}{λ_{m n}} .

Therefore, we have

\frac{1}{n} tr (P \sum_{k}) = \frac{1}{n} \sum_{m = 1}^{n} (\frac{\frac{1}{n} \sum_{i \neq m} \frac{x_{i}^{2}}{λ_{i n}}}{\frac{1}{n} \sum_{i = 1}^{n} \frac{x_{i}^{2}}{λ_{i n}}}) \frac{\partial_{k} (λ_{m n})}{λ_{m n}} .

Since

\frac{1}{n} \sum_{i \neq m}^{n} \frac{x_{i}^{2}}{λ_{i}} and \frac{1}{n} \sum_{i = 1}^{n} \frac{x_{i}^{2}}{λ_{i}}

are the same when n goes ∞,

\frac{1}{n} tr (P \sum_{k}) \to E (\frac{\partial_{k} (λ_{m n})}{λ_{m n}})

by the law of large numbers.

A.5. Conditions C1–C4

Cressie and Lahiri (1993, 1996) showed that a general result for the asymptotic property of REML estimator Θ̂_reml in a parametric linear model with Gaussian error, Y ~ N_n(Xβ, Σ(Θ)), holds under conditions C1–C4, which we describe later in this section, where Y is an n × 1 data vector (Y₁, …, Y_n)^T, X is an n × q matrix of explanatory variables, β is a q × 1 vector of unknown large scale effects, and Σ(Θ) is an n × n positive definite variance matrix which is known up to a k × 1 vector of small scale effects Θ = (θ₁, …, θ_k). In our studies, Σ(Θ) = σ²I+τK is a positive definite matrix. When we use the Gaussian kernel, Θ = (τ, ρ, σ²) and k = 3. For the polynomial and neural network kernels, Θ = (τ, σ²) and k = 2.

Let U = Γ′Y represent a vector of n–s linearly independent error contrast and; that is, the (n–s) columns of U are linearly independent and U ~ N(0, Γ′Σ(θ)Γ). If a set of (n–s) linearly independent contrasts are used to define U, the new negative loglikelihood function is

L_{U} (Θ) = \frac{n - s}{2} log (2 π) - \frac{1}{2} log (∣ X^{T} X ∣) + \frac{1}{2} log (\sum (Θ)) + \frac{1}{2} log (∣ X^{T} \sum {(Θ)}^{- 1} X ∣) + (\frac{1}{2}) Y^{T} P Y,

where P = Σ⁻¹−Σ⁻¹X(X^TΣ⁻¹X)⁻¹ X^TΣ⁻¹. A REML estimator of θ can be obtained by minimizing this equation. Let φ(Θ) = (∂²L_U(Θ)/∂θ_i∂θ_j), the k × k matrix of the second-order partial derivatives of the negative loglikelihood function L_U(·). Then E(φ(Θ)) = tr(PΣ_i(Θ)PΣ_j(Θ)/2). Under conditions C1–C4, Cressie and Lahiri (1993, 1996) showed that

{[E_{θ} (φ_{n} (Θ))]}^{1 / 2} ({\hat{Θ}}_{reml} - Θ) = O (1) .

In our semiparametric regression setting, we show that their result holds under certain situations which are described in Lemmas 1 and 2. By using Lemmas 1 and 2 of our paper and the result of Cressie and Lahiri (1993, 1996), there exists a c̄ such that (1/n) tr(PΣ_i(θ)PΣ_j(θ)/2)→_uc̄, where →_u denotes uniform convergence. Therefore, we can show that

\sqrt{n} ({\hat{Θ}}_{reml} - Θ) = O (1) .

The conditions needed for applying the result of Cressie and Lahiri (1993, 1996) are

C1. Σ(Θ) is twice continuously differentiable on Θ.
C2. For all c > 0, η > 0, let ||B|| = {tr(B′B)}^1/2,
1. $\sup {| | {E_{θ} φ_{n} (Θ)}^{- 1 / 2} {E_{θ} φ_{n} (Θ^{0})}^{1 / 2} - I_{k} | | : | | {[{E_{θ} φ (Θ)}^{- 1 / 2}]}^{'} (Θ - Θ^{0}) | | \leq c; Θ, Θ^{0} \in Ω} \to_{u} 0$
2. with $M = (θ_{1}^{0}, \dots θ_{k}^{0})$
  $P_{θ} (\sup {| | {E_{θ} φ_{n} (Θ)}^{- 1 / 2} {φ_{n} (M) - φ_{n} (Θ)} {({E_{θ} φ_{n} (Θ)}^{- 1 / 2})}^{T} | | : | | {({E_{θ} φ (Θ)}^{- 1 / 2})}^{T} (Θ - Θ_{i}^{0}) | | \leq c, 1 \leq i \leq k} > η) \to_{u} 0.$
C3. There exists a positive definite matrix W(θ), continuous in θ, such that Q_n(θ) uniformly converge to W(θ), where
$Q_{n} {(Θ)}_{i j} = tr (P \sum_{i} (Θ) P \sum_{j} (Θ)) / (| | P \sum_{i} (Θ) | |) (| | P \sum_{j} (Θ) | |) .$
Let |λ₁_n(Θ)| ≤ · · · ≤ |λ_nn(Θ)| denote the ordered absolute eigenvalues of Σ(Θ), $∣ λ_{1 n}^{i} (Θ) ∣ \leq \dots \leq ∣ λ_{n n}^{i} (Θ) ∣$ denote those of Σ_ij = ∂²Σ(Θ)/∂Θ_i∂Θ_j. Suppose there is a sequence (r_n)_n _{≥ 1} with lim sup_n_→∞r_n/n ≤ 1−δ^*, for some δ^* ∈ (0, 1), such that for any compact subset Ω_s ⊆ Ω, there exist constants 0 < C₁(Ω_s) < ∞ and η₁(Ω_s) > 0 such that
$\begin{array}{l} \underset{n \to \infty}{lim sup} max {∣ λ_{n} (Θ) ∣, ∣ λ_{n}^{i} (Θ) ∣, ∣ λ_{n}^{i j} (Θ) ∣ : 1 \leq i, j, \leq k} < C_{1} (Ω_{s}) < \infty, \\ \underset{n \to \infty}{lim sup} min {∣ λ_{1} (Θ) ∣, ∣ λ_{r_{n}}^{i} (Θ) ∣ : 1 \leq i \leq k} > η_{1} (Ω_{s}) > 0 \end{array}$

uniformly in Θ ∈ Ω_s.

A.6. Proof of Theorem 2

We check the validity of assumptions (C1), (C2), (C3), and (C4) as follows:

Conditions (C1) and (C2,ii) require smoothness of the variance–covariance matrix Σ(Θ) as a function of Θ. Since our Σ(Θ) = τK(ρ)+σ²I is a smooth function of θ, the conditions (C1) and (C2,ii) hold.
Since [Eθ{φ_n(Θ)}^1/2]_n _{≥ 1} is smooth to guarantee (C2.i), condition (C2.i) holds by the arguments of Cressie and Lahiri (1996).
C3. We show that condition (C3) holds under certain situation described in Lemmas 1 and 2. When we use the Gaussian, polynomial, and neural network kernels, Σ and Σ_τ are simultaneously diagonalizable as well as Σ and Σ_σ² are. For the Gaussian kernel, we also need to show that Σ and Σ_ρ are simultaneously diagonalizable when n goes to ∞. We can show this using Lemma 2 as follows:
- The ijth elements of matrix KK_ρ and K_ρK are
  $\begin{array}{l} {({KK}_{ρ})}_{i j} = \sum_{k} K_{i k} {(K_{ρ})}_{k j} = (1 + \sum_{k = 1}^{n} {∣ z_{k} - z_{j} ∣}^{2}) exp {- \frac{{∣ z_{i} - z_{k} ∣}^{2} + {∣ z_{k} - z_{j} ∣}^{2}}{ρ}}, \\ {(K_{ρ} K)}_{i j} = (1 + \sum_{k = 1}^{n} {∣ z_{i} - z_{k} ∣}^{2}) exp {- \frac{{∣ z_{i} - z_{k} ∣}^{2} + {∣ z_{k} - z_{j} ∣}^{2}}{ρ}} . \end{array}$
  
  Since E_z(KK_ρ−K_ρK) = 0 when n→∞, KK_ρ = K_ρK as n→∞. Therefore, K and K_ρ are almost simultaneously diagonalizable for large n. This means that, Σ and Σ_ρ are almost simultaneously diagonalizable for large n. From Lemma 1, (1/n) tr(PΣ_ρ) converges. Therefore, condition C3 holds if Σ and Σ_k are simultaneously diagonalizable.
C4. Since Σ(Θ) and Σ_i(Θ) are nonsingular and uniformly bounded over a compact subset of Ω, this condition holds.

The proof of Theorem 2 is done when Θ₀ is interior of parameter space Ω.

Since the intersection of parameter space Ω and the closure of a neighborhood about Θ constitute of closed intervals, Theorem 2 also holds when Θ₀ is on the boundary of parameter space Ω by arguments in Geyer (1994) and Vu and Zhou (1997) who extended the result of Chernoff (1954) to nonidentically distributed sampling. Geyer’s (1994) result is based on a sampling model that is essentially a stationary process, while Vu and Zhou (1997) has no such restriction and allows general nonidentically distributed sampling so that models with covariances can be included.

We can demonstrate that our results hold using Vu and Zhou (1997) by relating their conditions with ours. Vu and Zhou (1997) provided that the existence, consistency, and asymptotic properties of the local maximum estimators covering a large class of estimation problems which allow sampling from nonidentically distributed random variable under certain conditions, A1–A2, B1–B4, in their paper: Regularity condition RC3 of our paper infers A1 in -Vu and Zhou (1997); RC4 of our paper suggest A2; RC1 and 2 of our paper imply B1; RC2 infers B2; Condition C2 in A.5 of our paper infers B3 and B4 in Vu and Zhou (1997). Therefore, under regularity conditions, Theorem 2 also holds when Θ₀ is on the boundary of parameter space Ω.

We can also show our results using Geyer (1994) but we need to add his assumption D. Geyer (1994) assumed that the sampling distribution has a stationary process and showed that the existence, consistency, and asymptotic properties of the local maximum estimators under Assumptions A–D: since our density function f(D, Θ) is in C³ in Θ, Assumption A in Geyer (1994) is satisfied; RC3 infers Assumption B in Geyer (1994); Assumption C in Geyer (1994) is

\sqrt{n} (\sum_{i} \frac{\partial f (D_{i}, Θ)}{\partial Θ} - E (\frac{f (D_{i}, Θ)}{\partial Θ})) \to N (0, A)

for some covariance matrix A. This assumption C is satisfied by the weak law large number theory and central limit theory; Assumption D requires that the estimating sequence {Θ̂_n} = Θ₀ + o_p(1). Therefore, Theorem 2 also holds under regularity conditions.

A.7. Proof of Theorem 3

For consistent estimators Θ̂ using a first order Taylor expansion of l(·) near Θ₀ yields,

0 = {\frac{l (Θ)}{\partial Θ} |}_{\hat{Θ}} = {\frac{l (Θ)}{\partial Θ} |}_{Θ_{0}} + {\frac{l (Θ)}{\partial Θ} |}_{\bar{Θ}} (\hat{Θ} - Θ_{0}),

where Θ̄ is a vector between Θ̂ and Θ₀. Consequently, we have

\sqrt{n} (\hat{Θ} - Θ_{0}) = {{- \frac{1}{n} \frac{\partial^{2} l (Θ)}{\partial Θ \partial Θ^{T}} |}_{\bar{Θ}}}^{- 1} {\frac{1}{\sqrt{n}} \frac{\partial l (Θ)}{\partial Θ} |}_{Θ_{0}} .

Next, we show that, (i) the limit distribution of

{\frac{1}{\sqrt{n}} \frac{\partial l (Θ)}{\partial Θ} |}_{Θ_{0}} \to N {0, I (Θ_{0})}

and (ii)

- {\frac{1}{n} \frac{\partial l (Θ)}{\partial Θ \partial Θ^{T}} |}_{\bar{Θ}} \to I (Θ_{0})

as n→∞

To prove (i), note that

{\frac{1}{\sqrt{n}} \frac{\partial l (Θ)}{\partial Θ} |}_{Θ_{0}} = {\frac{1}{\sqrt{n}} \frac{\partial L (Θ)}{\partial Θ} |}_{Θ_{0}} + \frac{1}{2 \sqrt{n}} {(X^{T} \sum^{- 1} X)}^{- 1} (X^{T} \sum^{- 1} \sum_{Θ} \sum^{- 1} X) .

The first term converges in distribution to N(0, I(Θ₀)); the second term goes to zero with o(n^−1/2) because (X^TΣ⁻¹X)⁻¹(X^TΣ⁻¹Σ_ΘΣ⁻¹X) = o(1). Thus, (i) holds.

For (ii), since

- {\frac{1}{n} \frac{\partial^{2} l (Θ)}{\partial Θ Θ^{T}} |}_{\bar{Θ}} = {- \frac{1}{n} \frac{\partial^{2} L (Θ)}{\partial Θ \partial Θ^{T}} |}_{\bar{Θ}} - \frac{1}{2 n} {{(X^{T} \sum^{- 1} X)}_{Θ}^{- 1} X^{T} (\sum^{- 1} \sum_{Θ} \sum^{- 1}) X + {(X^{T} \sum^{- 1} X)}^{- 1} X^{T} {(\sum^{- 1} \sum_{Θ} \sum^{- 1})}_{Θ} X}

and again the second term of the right side goes to zero, the first term converges in probability to I(Θ₀) (Lehmann, 1983).

Thus, applying Slutsky’s lemma and converting back to the original parameter space via the delta method, the asymptotic normality holds. The proof of (3.1) is done.

The proof of (3.2) can be shown by Chant’s (1974) case (i). Suppose that Ω = Ω₁ × Ω₂ × Ω₃, Θ₁ is a left endpoint of Ω₁, and other components of Θ are interior points of Ω_i, i = 2, 3. Then Θ̂ can be expressed as

(\begin{matrix} N_{1} \\ N_{2} \\ N_{3} \end{matrix}) I {N_{1} > 0} + (\begin{matrix} 0 \\ N_{2} - (I_{21} / I_{11}) N_{1} \\ N_{3} - (I_{31} / I_{11}) N_{1} \end{matrix}) I {N_{1} > 0},

where N is a random variable with a multivariate Gaussian distribution with mean Θ and covariance I⁻¹(Θ₀), where Θ is restricted to lie in C_Ω–Θ₀ and C_Ω is a cone with vertex at Θ₀. This expression corresponds to the result of Chant’s (1974) case (i).

A.8. Proof of Theorem 4

If Θ̂ is consistent then by a Taylor series approximation

\sqrt{n} (\hat{Θ} - Θ_{0}) = {{- \frac{1}{n} \frac{\partial^{2} l (Θ)}{\partial Θ \partial Θ^{T}} |}_{\bar{Θ}}}^{- 1} {\frac{1}{\sqrt{n}} \frac{\partial l (Θ)}{\partial Θ} |}_{Θ_{0}} .

Therefore,

\begin{array}{l} {\hat{Θ}}_{1} - Θ_{0} \approx {(- l_{1, Θ Θ} ∣_{\bar{Θ}})}^{- 1} l_{1, Θ} ∣_{Θ_{0}}, \\ {\hat{Θ}}_{2} - Θ_{0} \approx {(- l_{2, Θ Θ} ∣_{\bar{Θ}})}^{- 1} l_{2, Θ} ∣_{Θ_{0}}, \end{array}

where Θ̄ is a vector between Θ̂ and Θ₀, l_i.Θ = ∂l_i(Θ)/∂Θ^T, and l_i.ΘΘ = ∂l_i.Θ(Θ)/∂Θ^T, i = 1, 2.

By applying chain-rule, we obtain the first derivative and second derivatives of REML with respect to Θ as follows:

l_{2, Θ} (Θ) = l {Θ, \hat{β} (Θ)}_{Θ} = l_{Θ} (Θ) + l_{\hat{β}} (Θ) {\hat{β}}_{Θ} (Θ) = l_{1, Θ} (Θ) + l_{\hat{β}} (Θ) {\hat{β}}_{Θ} (Θ) = l_{1, Θ} (Θ) (since l_{\hat{β}} = 0)

and

\begin{array}{l} l_{2. Θ Θ} (Θ) = l {Θ, \hat{β} (Θ)}_{Θ Θ} = l_{Θ Θ} (Θ) + l_{\hat{β} Θ} (Θ) {\hat{β}}_{Θ} (Θ) + l_{\hat{β}} (Θ) {\hat{β}}_{Θ Θ} (Θ) = l_{Θ Θ} (Θ) + {\frac{\partial l_{\hat{β}} (Θ)}{\partial \hat{β} (Θ)} \frac{\partial \hat{β} (Θ)}{\partial Θ}} {\hat{β}}_{Θ} (Θ) + l_{\hat{β}} (Θ) {\hat{β}}_{Θ Θ} (Θ) \\ = l_{Θ Θ} (Θ) + l_{\hat{β} \hat{β}} (Θ) {{\hat{β}}_{Θ} (Θ)}^{2} + l_{\hat{β}} (Θ) {\hat{β}}_{Θ Θ} (Θ) = l_{1, Θ Θ} (Θ) + l_{\hat{β} \hat{β}} {{\hat{β}}_{Θ} (Θ)}^{2} (since l_{\hat{β}} = 0, l_{\hat{β} \hat{β}} {({\hat{β}}_{Θ})}^{2} < 0) \\ < l_{1, Θ Θ} < 0. \end{array}

Since ||l_2._ΘΘ|| >||l_1._ΘΘ||, we obtain the following inequality:

| | {\hat{Θ}}_{1} - Θ | | \approx | | l_{1. Θ Θ}^{- 1} l_{1. Θ} | | > | | l_{2. Θ Θ}^{- 1} l_{1. Θ} | | = | | l_{2. Θ Θ}^{- 1} l_{2. Θ} | | \approx | | {\hat{Θ}}_{2} - Θ | | .

Therefore, E||Θ̂₁−Θ||² > E||Θ̂₂−Θ||². This means that the estimators based on our REML have theoretically smaller mean square error than those of LLG’s REML. The proof of Theorem 4 is done.

A.9. Non-identifiability between τ and ρ when ρ→0

If τ ~ O(1/ρ^m) for any positive value m and ρ ~ O{E(||z−z′||²)}, Θ̂₂ is asymptotically normally distributed with mean Θ and covariance matrix I(Θ). But if ρ→0, H_τ = H_ττ = P_τ = P_ττ = 0 and H_ρ = H_ρρ = P_ρ = P_ρρ = 0, then the asymptotic distributions of τ̂₂ and ρ̂₂ are the same and degenerated.

For LLG’s estimator Θ̂₁, as ρ→0, the asymptotic distributions of τ̂₁ and ρ̂₁ are also the same because the information matrix of Liu et al. (2007) is

I_{θ_{l}, θ_{l^{'}}} = \frac{1}{2} tr {P \sum_{θ_{l}} P \sum_{θ_{l^{'}}}}

and Σ_{θ_l} → 0.

A.10. The proof of equivalence of the two testings

The testing of H₀: {r(Z) is a point mass at zero} ∪ {r(Z) has a constant covariance matrix as a function of z} is equivalent to the testing of ∂K(Z)/∂Z = 0.

\begin{array}{l} K (z_{i}, z_{j}) = τ exp (- \frac{{| | z_{i} - z_{j} | |}^{2}}{ρ}), \\ \frac{\partial K (z_{i}, z_{j})}{\partial z_{i}} = - 2 (| | z_{i} - z_{j} | |) (\frac{τ}{ρ}) exp (- \frac{{| | z_{i} - z_{j} | |}^{2}}{ρ}) . \end{array}

If ρ→0 and τ→0 with faster rate O(ρ^m), then ∂K(Z)/∂Z = 0. That is, if τ/ρ→0, then ∂K(Z)/∂Z = 0.

If ρ→∞, then 0 ≤ exp(−||z_i−z_j||²/ρ) ≤ 1. Therefore,

0 \leq \frac{τ}{ρ} exp (- \frac{{| | z - z_{j} | |}^{2}}{ρ}) \leq \frac{τ}{ρ} .

Hence, if

\frac{τ}{ρ} \to 0, \frac{\partial K (Z)}{\partial Z} = 0 .

If ρ→0 and τ ~ O(1/ρ^m), then exp(−||z_i−z_j||²/ρ)→0. Therefore, ∂K(Z)/∂Z = 0. Therefore, if τ/ρ→0 or ρ→0, then ∂K(Z)/∂Z = 0.

Appendix B. Supplementary

Some additional results and information on pathway and genes are available in a separate file for supplementary materials.

Supplementary data associated with this paper can be found in the online version at http://dx.doi.org/10.1016/j.jspi.2012.09.009.

References

Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B. 1995;57:289–300. [Google Scholar]
Chant D. On asymptotic tests of composite hypotheses in nonstandard conditions. Biometrika. 1974;61:291–298. [Google Scholar]
Chernoff On the Distribution of the Likelihood Ratio. Annals of Mathematical Statistics. 1954;25:573–578. [Google Scholar]
Claeskens G. Restricted likelihood ratio lack of fit tests using mixed spline models. Journal of the Royal Statistical Society: Series B. 2004;66:909–926. [Google Scholar]
Cressie N, Lahiri SN. The asymptotic distribution of REML estimators. Journal of Multivariate Analysis. 1993;45:217–233. [Google Scholar]
Cressie N, Lahiri SN. Asymptotic for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference. 1996;50:327–341. [Google Scholar]
Cristianini N, Shawe-Taylor J. Kernel Methods for Pattern Analysis. Cambridge University Press; Cambridge: 2006. [Google Scholar]
Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics. 2002;31:19–20. doi: 10.1038/ng0502-19. [DOI] [PubMed] [Google Scholar]
Efron B, Hastie T, Johnstine I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–499. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Friedman JH. Multivariate adaptive regression splines (with discussion) The Annals of Statistics. 1991;19:1–141. [Google Scholar]
Geyer CJ. On the asymptotics of constrained M-estimation. The Annals of Statistics. 1994;22:1993–2010. [Google Scholar]
Goeman J, van de Geer SA, Kort F, Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
Gu C, Ma P. Optimal smoothing in nonparametric mixed effect models. Annals of Statistics. 2005;33:1357–1379. [Google Scholar]
Harris MA, et al. The gene ontology (GO) database and informatics resource. Nucleic Acids Research. 2004;32:D258–D261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, CRC; 1990. [Google Scholar]
Hosack DA, Dennis G, Jr, Sherman BT, Clifford H, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biology. 2003;4 (10):R70. doi: 10.1186/gb-2003-4-10-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiamjarasrangsi W, Lertmaharit S, Sangwatanaroj S, Lohsoonthorn V. Type 2 diabetes, impaired fasting glucose, and their association with increased hepatic enzyme levels among the employees in a University Hospital in Thailand. Journal of the Medical Association of Thailand. 2009;92:961–968. [PubMed] [Google Scholar]
Kim I, Pang H, Zhao H. Bayesian semiparametric regression models for evaluating pathway effects on clinical continuous and binary outcomes. Statistics in Medicine. 2012;31:1633–1651. doi: 10.1002/sim.4493. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lehmann EL. Theory of Point Estimation. John Wiley; New York: 1983. [Google Scholar]
Liu D, Lin X, Ghosh D. Semiparametric regression of multi-dimensional genetic pathway data: least square kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Misu H, Takamura T, Matsuzawa N, Shimizu A, Ota T, Sakurai M, Ando H, Arai K, Yamashita T, Honda M, Yamashita T, Kaneko S. Genes involved in oxidative phosphorylation are coordinately upregulated with fasting hyperglycaemia in livers of patients with type 2 diabetes. Diabetologia. 2007;50:268–277. doi: 10.1007/s00125-006-0489-8. [DOI] [PubMed] [Google Scholar]
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson P, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34:267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
Mootha VK, Handschin C, Arlow D, Xie X, Pierre JS, Sihag S, Yang W, Altshuler D, Puigserver P, Patterson N, Willy PJ, Schulman IG, Heyman RA, Lander ES, Spiegelman BM. Err α and Gabpa/b specify PGC-1 α-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle. Proceedings of the National Academy of Sciences. 2004;101:6570–6575. doi: 10.1073/pnas.0401401101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rajagopalan DA, Agarwal P. Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005;21:788–793. doi: 10.1093/bioinformatics/bti069. [DOI] [PubMed] [Google Scholar]
Saxena V. PhD Dissertation of Mechanical Engineering. Massachusetts Institute of Technology; 2001. Genomic Response, Bioinformatics, and Mechanics of the Effects of Forces on Tissues and Wound Healing. [Google Scholar]
Schwarz GE. Estimating the dimensional of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of American Statistical Association. 1987;82:605–610. [Google Scholar]
Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics. 2003;31:2013–2035. [Google Scholar]
Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B. 2002;64:479–498. [Google Scholar]
Subramanian A, Tamayoa P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub TR, Lander ES, Mesurov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102 (43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Vu HTV, Zhou S. Generalization of likelihood ratio tests under nonstandard conditions. The Annals of Statistics. 1997;25:916–987. [Google Scholar]
Wang Y. Mixed-effects smoothing spline ANOVA. Journal of the Royal Statistical Society, Series B. 1998;60:159–174. [Google Scholar]
Zhang D, Lin X, Raz J, Sowers M. Semi-parametric mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix B

NIHMS455567-supplement-Appendix_B.pdf^{(98.5KB, pdf)}

[R1] Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B. 1995;57:289–300. [Google Scholar]

[R3] Chant D. On asymptotic tests of composite hypotheses in nonstandard conditions. Biometrika. 1974;61:291–298. [Google Scholar]

[R4] Chernoff On the Distribution of the Likelihood Ratio. Annals of Mathematical Statistics. 1954;25:573–578. [Google Scholar]

[R5] Claeskens G. Restricted likelihood ratio lack of fit tests using mixed spline models. Journal of the Royal Statistical Society: Series B. 2004;66:909–926. [Google Scholar]

[R6] Cressie N, Lahiri SN. The asymptotic distribution of REML estimators. Journal of Multivariate Analysis. 1993;45:217–233. [Google Scholar]

[R7] Cressie N, Lahiri SN. Asymptotic for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference. 1996;50:327–341. [Google Scholar]

[R8] Cristianini N, Shawe-Taylor J. Kernel Methods for Pattern Analysis. Cambridge University Press; Cambridge: 2006. [Google Scholar]

[R9] Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics. 2002;31:19–20. doi: 10.1038/ng0502-19. [DOI] [PubMed] [Google Scholar]

[R10] Efron B, Hastie T, Johnstine I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R11] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R12] Friedman JH. Multivariate adaptive regression splines (with discussion) The Annals of Statistics. 1991;19:1–141. [Google Scholar]

[R13] Geyer CJ. On the asymptotics of constrained M-estimation. The Annals of Statistics. 1994;22:1993–2010. [Google Scholar]

[R14] Goeman J, van de Geer SA, Kort F, Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]

[R15] Gu C, Ma P. Optimal smoothing in nonparametric mixed effect models. Annals of Statistics. 2005;33:1357–1379. [Google Scholar]

[R16] Harris MA, et al. The gene ontology (GO) database and informatics resource. Nucleic Acids Research. 2004;32:D258–D261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, CRC; 1990. [Google Scholar]

[R18] Hosack DA, Dennis G, Jr, Sherman BT, Clifford H, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biology. 2003;4 (10):R70. doi: 10.1186/gb-2003-4-10-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Jiamjarasrangsi W, Lertmaharit S, Sangwatanaroj S, Lohsoonthorn V. Type 2 diabetes, impaired fasting glucose, and their association with increased hepatic enzyme levels among the employees in a University Hospital in Thailand. Journal of the Medical Association of Thailand. 2009;92:961–968. [PubMed] [Google Scholar]

[R20] Kim I, Pang H, Zhao H. Bayesian semiparametric regression models for evaluating pathway effects on clinical continuous and binary outcomes. Statistics in Medicine. 2012;31:1633–1651. doi: 10.1002/sim.4493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lehmann EL. Theory of Point Estimation. John Wiley; New York: 1983. [Google Scholar]

[R22] Liu D, Lin X, Ghosh D. Semiparametric regression of multi-dimensional genetic pathway data: least square kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Misu H, Takamura T, Matsuzawa N, Shimizu A, Ota T, Sakurai M, Ando H, Arai K, Yamashita T, Honda M, Yamashita T, Kaneko S. Genes involved in oxidative phosphorylation are coordinately upregulated with fasting hyperglycaemia in livers of patients with type 2 diabetes. Diabetologia. 2007;50:268–277. doi: 10.1007/s00125-006-0489-8. [DOI] [PubMed] [Google Scholar]

[R24] Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson P, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34:267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]

[R25] Mootha VK, Handschin C, Arlow D, Xie X, Pierre JS, Sihag S, Yang W, Altshuler D, Puigserver P, Patterson N, Willy PJ, Schulman IG, Heyman RA, Lander ES, Spiegelman BM. Err α and Gabpa/b specify PGC-1 α-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle. Proceedings of the National Academy of Sciences. 2004;101:6570–6575. doi: 10.1073/pnas.0401401101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Rajagopalan DA, Agarwal P. Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005;21:788–793. doi: 10.1093/bioinformatics/bti069. [DOI] [PubMed] [Google Scholar]

[R27] Saxena V. PhD Dissertation of Mechanical Engineering. Massachusetts Institute of Technology; 2001. Genomic Response, Bioinformatics, and Mechanics of the Effects of Forces on Tissues and Wound Healing. [Google Scholar]

[R28] Schwarz GE. Estimating the dimensional of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R29] Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of American Statistical Association. 1987;82:605–610. [Google Scholar]

[R30] Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics. 2003;31:2013–2035. [Google Scholar]

[R31] Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B. 2002;64:479–498. [Google Scholar]

[R32] Subramanian A, Tamayoa P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub TR, Lander ES, Mesurov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102 (43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R34] Vu HTV, Zhou S. Generalization of likelihood ratio tests under nonstandard conditions. The Annals of Statistics. 1997;25:916–987. [Google Scholar]

[R35] Wang Y. Mixed-effects smoothing spline ANOVA. Journal of the Royal Statistical Society, Series B. 1998;60:159–174. [Google Scholar]

[R36] Zhang D, Lin X, Raz J, Sowers M. Semi-parametric mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719. [Google Scholar]

[R37] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]

PERMALINK

Statistical properties on semiparametric regression for evaluating pathway effects

Inyoung Kim

Herbert Pang

Hongyu Zhao

Abstract

1. Introduction

2. Semiparametric regression

3. Asymptotic properties

Theorem 1

Lemma 1

Lemma 2

Theorem 2

Theorem 3

Theorem 4

Theorem 5

4. Test for the nonparametric function

4.1. Profile restricted likelihood ratio test

4.1.1. Test for H0: τ/ρ = 0 vs Ha: τ/ρ>0

4.1.2. Test for H0: ρ = 0

4.2. Permutation test

5. Simulation studies

5.1. Mean squared error and coverage probability

Table 1.

5.2. Consistency

Table 2.

5.3. Type I error and power

Table 3.

Table 5.

6. Real data analysis

6.1. Pathway based analysis for type II diabetes

6.2. Identifying significant pathways

Table 6.

7. Discussion

Supplementary Material

Table 4.

Acknowledgments

Appendix A. Technical complements

A.1. Score equations and information matrix

A.2. Regularity conditions

RC1

RC2

RC3

RC4

A.3. Proof of Theorem 1

A.4. Proof of Lemma1

A.5. Conditions C1–C4

A.6. Proof of Theorem 2

A.7. Proof of Theorem 3

A.8. Proof of Theorem 4

A.9. Non-identifiability between τ and ρ when ρ→0

A.10. The proof of equivalence of the two testings

Appendix B. Supplementary

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.1.1. Test for H₀: τ/ρ = 0 vs H_a: τ/ρ>0

4.1.2. Test for H₀: ρ = 0