Abstract
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.’s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.’s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
Keywords: Gaussian random process, Kernel machine, Mixed model, Pathway analysis, Profile likelihood, Restricted maximum likelihood
1. Introduction
Numerous statistical methods have been developed for analyzing microarray data based on single genes (Efron et al., 2004; Fan and Li, 2001; Tibshirani, 1996; Zou and Hastie, 2005). However, they may not detect coordinated yet subtle changes among a set of genes where each gene only shows modest changes. One way to address the limitation of single gene based analysis is to analyze gene sets derived from prior biological knowledge to uncover patterns among the genes within a set. A number of methods and programs have been developed to consider gene groupings based on gene ontology (GO) (Harris et al., 2004). These methods have been successful to detect subtle changes in expression levels which could be missed using single gene-based analysis (Mootha et al., 2003; Hosack et al., 2003; Rajagopalan and Agarwal, 2005). In our following discussion, we consider a pathway as a predefined set of genes that serve a particular cellular or physiological function.
Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. Goeman et al. (2004) proposed a global test derived from a random effects model. Liu et al. (2007) proposed a semiparametric regression model for high dimensional covariate data. Liu et al.’s approach is developed by connecting between a least squares kernel machine in a nonparametric model with REML in a linear mixed model. However, statistical properties of their approach have not been studied. In the rest of this paper, we refer Liu et al.’s approach to LLG’s approach.
The goal of our study is to provide the asymptotic properties of these estimators on consistency, convergence rate, and limiting distributions under regularity conditions. We compare LLG’s estimators with ours which are estimated using a REML obtained from a profile likelihood. We note that although our likelihood is a penalized likelihood, we simply use it as a profile likelihood. Note that there is a penalized likelihood called penalized Henderson’s likelihood (Wang, 1998; Gu and Ma, 2005) which are obtained using smoothing splines.
We show that our REML estimators give more accurate score equations and information matrix, and have smaller mean squared errors (MSEs) than those of LLG’s REML. For non-standard testing problems, a profile restricted likelihood ratio test is also provided.
This paper is organized as follows. In Section 2, we give the definition of the least square kernel machine estimation in the semiparametric model. We then explain LLG’s likelihood-based approach on semiparametric regression and compare their method with our REML obtained from a profile likelihood approach. Section 3 describes the asymptotic properties for both estimators. In Section 4, we provide a testing procedure based on a profile restricted likelihood ratio test for nonstandard testing problem and also derive its theoretical distribution. In Section 5 we report the simulation results comparing MSEs, type I errors, and power of LLG’s estimators with those. In Section 6, we apply our approach to the type II diabetes data set analyzed by Mootha et al. (2003). Section 7 contains concluding remarks.
2. Semiparametric regression
A semiparametric regression model can be written as
| (1) |
where Y is an n × 1 vector denoting the continuous outcome measured on n subjects, X is an n × q matrix representing q clinical covariates of these subjects, β is a q × 1 vector of regression coefficients for the covariate effects, Z is a p × n matrix denoting the gene expression matrix for p genes (p ≫ n), Z = [z1, …, zn], and zi is a p × 1 vector for the gene expression levels of the ith subject, r(·) is an unknown nonlinear smooth function, and ε ~ N(0, σ2I), where σ2 >0. Because of the high dimensional space of Z, we make statistical inference for the model (1) by connecting a least squares kernel machine with a Gaussian random process, where r(Z) = {r(z1), …, r(zn)} ~ N(0, τK) with r(·) following a Gaussian process (GP) with mean 0 and covariance cov{r(z), r(z′)} = τK(z, z′), τ>0, z and z′ represent two different arbitrary p × 1 vectors, K(·, ·) is a kernel function which implicitly specifies a unique function space spanned by a particular set of orthogonal basis functions, and K is an n × n matrix with the ijth component K(zi, zj), i, j = 1, …, n.
The linear effects of the clinical variables are adjusted in this model. This model reduces to the standard linear regression model when τ = 0. If τ>0, two samples i and j with similar gene expression patterns have correlated random effects r(zi) and r(zj), and therefore they have a greater probability of having similar outcomes yi and yj than samples with less similar expression patterns. This “similarity” is measured using a kernel function.
We consider the following three kernels K(·, ·) to model the covariance matrix of pathway effects:
The dth order polynomial kernel, K(z, z′) = (zz′)d, d = 1, 2, which quantifies similarity through the inner-product.
Gaussian kernel, K(z, z′) = exp(−||z−z′||2/ρ), where and ρ is an unknown scale parameter (ρ>0). The “similarity” in this kernel is measured using the Euclidean distance between z and z′.
Neural network kernel, K(z, z′) = tanh(z · z′), which uses the hyperbolic tangent (tanh) function to quantify similarity.
By connecting a least squares kernel machine (Cristianini and Shawe-Taylor, 2006) with a restricted maximum likelihood (REML) (Zhang et al., 1998; Wang, 1998), Liu et al. (2007) estimated the nonparametric pathway effects of multiple gene expressions, r(Z), using the least squares kernel machine, estimated the covariate effects, β, using the weighted least squares estimator, and estimated the parameters of the variance components using REML. They derived the score equations and information matrix of variance component parameters by treating the estimated β̂ as if they were the true parameter without uncertainty. However, because β̂ depends on the variance components parameters, we consider incorporating the relationship between β̂ and the variance components parameters into the REML to obtain more accurate score equations and information matrix of these parameters.
We adopt the same general approach to estimate the nonparametric function r(·). β and r(·) are estimated by maximizing the scaled penalized likelihood function with smoothing parameter λ
where ||·||HK represents the norm induced by Kernel K of Hk, the function space generated by a kernel function K. This estimation is called least square kernel machine estimation (Liu et al., 2007).
By using the dual formulation (Cristianini and Shawe-Taylor, 2006; Liu et al., 2007) to reduce the high dimensional problem into a low dimensional one, the parameters β and r(·) can be estimated as follows:
For estimating the variance components parameters, we start from a mixed model formulation with the assumption that the nonparametric function r(·) follows a Gaussian process (GP) with mean 0 and covariance cov{r(z), r(z′)} = τK(z, z′) = τ exp(−||z−z′||2/ρ), where ρ>0 is an unknown scale parameter. Using the marginal distribution of y ~ N(Xβ, Σ), the regression coefficient estimator β̂ can be obtained from the weighted least squares estimator β̂ = (XTΣ−1X)−1XTΣ−1Y, where Σ = σ2I+τK = Σ(Θ), where Θ = (τ, ρ, σ2) and Σ(Θ) are an n × n positive definite variance matrix. The covariance of β̂ and the nonparametric function r̂(·) can be estimated as cov(β̂) = (XTΣ−1X)−1, cov(r̂) = τK−(τK)P(τK), and cov{r̂(z)} = τK(z, z)− (τKz)P(τKz), where P = Σ−1−Σ−1X(XTΣ−1X)−1XTΣ−1 and Kz = {K(z, z1), …, K(z, zn)}T. Based on the above estimates, the variance components parameters, Θ = (τ, ρ, σ2), can be estimated through the REML, a commonly used approach for estimating variance components in the mixed effects model, and the smoothing parameter can be obtained from λ = τ−1σ2. The REML is
LLG’s REML and our REML are
| (2) |
| (3) |
LLG calculate the score equations and information matrix of Θ = (τ, ρ, σ2) at given β̂. Unlike LLG’s approach (2007), we consider a profile likelihood and replace β̂ by (XTΣ−1X)−1XTΣ−1y because β̂ depends on the parameters Θ. We then obtain the score equations and information matrix of Θ (see Appendix A.1). The parameters are estimated using the Newton–Raphson method to estimate all parameters.
3. Asymptotic properties
In this section, we will provide the asymptotic properties on consistency, convergence rate, and limiting distribution for the estimators. We will also show that the estimators from our REML have asymptotically smaller MSE than those of LLG’s REML.
We first give a consistency result of our estimators when the true parameter Θ0 is either in the interior region or on the boundary of parameter space Ω. For this result we need that the parameter space (Ω) near Θ0 behaves like a closed set (Self and Liang, 1987; Vu and Zhou, 1997) which imposes the condition that the intersection of Ω and the closure of neighborhoods centered about Θ0 compose closed subsets. This requirement is satisfied because our Ω is a rectangle which is a cross product of intervals.
Theorem 1
If Θ̂1 and Θ̂2 are estimators of Θ using REML functions (2) and (3), respectively, then Θ̂1 and Θ̂2 are both consistent estimators under regularity conditions.
Theorem 1 proves that the estimators from both approaches are consistent under regularity conditions. These regularity conditions are described in Appendix A.2. The proof of Theorem 1 is summarized in Appendix A.3.
Next, we will show that both approaches have convergence rate. Before proving this convergence rate, we first explain the following two lemmas (Lemmas 1 and 2).
Lemma 1
If Σ and Σk are simultaneously diagonalizable, where Σk is the first derivative of Σ with respect to the kth component of Θ, then (1/n) tr(PΣk) uniformly converges. That is, (1/n) tr(PΣτ) and (1/n) tr(PΣσ2) uniformly converge.
Its proof is summarized in Appendix A.4. Lemma 1 implies that tr(PΣiPΣj)1/2 uniformly converges because 0≤tr(PΣiPΣj)1/2 ≤tr(PΣi)1/2 tr(PΣj)1/2.
Lemma 2
If ΣΣk = ΣkΣ, then Σ and Σk are simultaneously diagonalizable.
It is well known that if AB = BA for two matrices A and B, then A and B are simultaneously diagonalizable. We need to use this fact in our proof.
The convergence rate can be proved using (1) these two lemmas, (2) four conditions C1–C4 (see Appendix A.5), which are described in Cressie and Lahiri (1993, 1996), and (3) the condition that the intersection of Ω and the closure of neighborhoods centered about Θ0 constitute closed subsets (Vu and Zhou, 1997).
Theorem 2
Under conditions C1–C4 in Appendix A.5, and .
In contrast to Cressie and Lahiri (1993, 1996) who showed the asymptotic property of REML estimator in a linear model with Gaussian error, we consider a semiparametric model. The proof of Theorem 2 is described in Appendix A.6. Because the intersection of Ω and the closure of neighborhoods centered about Θ0 constitute closed subsets, this result also holds when the true parameter is on the boundary of the parameter space Ω (Vu and Zhou, 1997). In our study we use the Gaussian, polynomial, and neural network kernels. We show that Σ and Στ are simultaneously diagonalizable as well as Σ and Σσ2. For the Gaussian kernel, using Lemma 2, we show that Σ and Σρ are simultaneously diagonalizable when n goes to ∞.
Theorem 3
(3.1) Suppose that the true parameters are the interior points of the parameter space. Then under regularity conditions, Θ̂1 and Θ̂2 are asymptotically normally distributed.
(3.2) Suppose that one component of the true parameters is a left endpoint of the parameter space. Then under regularity conditions, Θ̂1 and Θ̂2 are asymptotically distributed with a mixture of normal distributions.
The proof of (3.1) is explained in Appendix A.7. The proof of (3.2) can be shown by Chant’s (1974) case (i).
As for the MSEs, we compare our estimator based on our REML (2) with LLG’s estimator based on LLG’s REML (1). We prove that the estimators from our REML have asymptotically smaller MSEs than those of LLG’s REML in the following theorem.
Theorem 4
E||Θ̂2−Θ||2 < E||Θ̂1−Θ||2 asymptotically.
Its proof is given in Appendix A.8. This results implies that our score equations and information matrix are more accurate than those of Liu et al. (2007) because we take into account the relationship between β̂ and the variance components in REML.
Although estimators of parameters in the semiparametric regression approach developed by connecting a least squares kernel machine with REML have useful large sample properties, there are some situations that ρ and τ are not identifiable. This result is summarized in Theorem 5.
Theorem 5
ρ and τ are not identifiable under one of the following situations: (i) τ→0; (ii) ρ→0 and τ ~ O(1/ρm) for any positive m; or (iii) 1/ρ→0.
This is because the marginal distribution of Y, f(Y|τ) = f(Y|ρ) ~ N(0, σ2I) under conditions (i) and (ii) and f(Y|τ) = f(Y|ρ) ~ N(0, σ2I+τJ) under condition (iii). Its proof is given in Appendix A.9.
4. Test for the nonparametric function
Our main interest is to test H0: r(Z) = constant, where r(·) is an unknown nonlinear smooth function under model (1), which means that a null hypothesis under the Gaussian random process using Gaussian kernel, H0: {r(Z) is a point mass at zero} ∪ {r(Z) has a constant covariance matrix as a function of Z} is equivalent to H0: τ/ρ = 0 or ρ = 0. For the polynomial and neural network kernels, our null hypothesis is H0: τ = 0. The proof of this equivalence is given in Appendix A.10. We perform this testing based on two approaches: one is a profile restricted likelihood ratio test, which is described in Section 4.1, and the other is a permutation test, which is explained in Section 4.2.
4.1. Profile restricted likelihood ratio test
We derive a profile restricted likelihood ratio test (PRLRT) by taking into account that the parameter of interest under the null hypothesis is on the boundary of the parameter space and the kernel matrix K is not a block diagonal matrix. Theoretical properties of our profile REML estimators, which we have shown in Section 3, are needed for PRLRT. We can perform PRLRT test using both empirical distribution obtained by permutation as well as theoretical distribution derived in the following subsections for each null hypothesis.
4.1.1. Test for H0: τ/ρ = 0 vs Ha: τ/ρ>0
Test for H0 means that either H01: τ = 0 and ρ is positive and bounded or H02: ρ = ∞ and τ is positive and bounded. The PRLRT can be derived as follows.
Test for H01: τ = 0 vs τ >0, where ρ is positive and bounded. Let Ω be the parameter space for Θ = (τ, ρ, σ2)T. Denote Ω0 and Ω1 = Ω/Ω0 as the parameter spaces under H0 and Ha. The true parameters Θ0 are either in the interior or on the boundary of the parameter space Ω. Assume that the parameter spaces Ω0 and Ω1 can be approximated at Θ0 by cones CΩ0 and CΩ1, respectively, with vertex Θ0.
The PRLRT statistic, D, is the deviance of two times the log PRLRT, D = 2l2(Θ)−2l2(Θ0). By Claeskens (2004) who extended the non-standard LRT test to PRLRT, D converges to
where C̃ = {Θ̃: Θ̃ = I(Θ0)T/2(Θ−Θ0), Θ ∈ CΩ} is the orthonormal transformation of the cone approximation, CΩ, of the parameter space Ω with Θ0 as the vertex, and C̃0 = {Θ̃: Θ̃ = I(Θ0)T/2(Θ−Θ0), Θ ∈ CΩ0} is the orthonormal transformed cone approximation of the parameter space Ω0 under the null hypothesis. U is a random vector from N(0, I), and I(Θ0)T/2 is the right Cholesky square root of profile REML information matrix, i.e, I(Θ0) = [I(Θ0)]1/2[I(Θ0)]T/2.
Under the null hypothesis, Θ0 = (0, ρ, σ2)T, ρ is inestimable. Let Θ = (τ, σ2). Now the cone parameter spaces are reduced to CΩ = [0, ∞) × (0, ∞). CΩ0 = {0} × (0, ∞) and CΩ1 = (0, ∞)2.
Decompose normal vector U and I(Θ0) as U = (U1, U2)T and I(Θ0) = {Ijk} corresponding to τ and σ2. After some algebra, we can easily show that
and
where . Therefore, the difference between these two becomes . Since , the asymptotic distribution of D under H01 is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., .
Test for H02: ρ = ∞, where τ is positive and bounded. This test H02: ρ = ∞ means that 1/ρ = 0. Let us denote that ρ* = 1/ρ. In this case, , where θ1 = ρ* and θ2 = (τ, σ2). CΩ0 = {0} × (0, ∞)2 and CΩ1 = (0, ∞)3. Similar to the previous case, decompose normal vector U and I(Θ0) as and I(Θ0) = {Ijk} corresponding to θ1 and θ2. We can then show that
| (4) |
| (5) |
In a similar way, the asymptotic distribution of D under H02: ρ = ∞ is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., .
4.1.2. Test for H0: ρ = 0
Test for H0 means that either H03: ρ = 0 and τ is positive and bounded or H04: ρ = 0 and τ = 0.
Test for H03: ρ = 0, where τ is positive and bounded. In this case, , where θ1 = ρ and θ2 = (τ, σ2). CΩ0 = {0} × (0, ∞)2 and CΩ1 = (0, ∞)3. Similar to the previous case, decompose U and I(Θ0) as and I(Θ0) = {Ijk} corresponding to θ1 and θ2. In a similar way, the asymptotic distribution of D under H03 is a 50:50 mixture between a point mass at zero and a chi-square distribution with 1 degree of freedom, i.e., .
Test for H04: ρ = 0 and τ = 0, where ρ < o(τ). In this case, Θ = (θ1, θ2, θ3), where θ1 = ρ, θ2 = τ, and θ3 = σ2. CΩ0 = {0}2 × (0, ∞) and CΩ1 = (0, ∞)3. Decompose normal vector U and I(Θ0) as U = (U1, U2, U3)T and I(Θ0) = {Ijk} corresponding to θ1, θ2, and θ3. Under the orthonormal transformation, the cone spaces become to C̃ = {Θ; ηθ2−θ1 ≥ 0, θ2 ≥ 0} and C̃0 = {Θ: θ2 = θ1 = 0}, where η = Ĩ12|Ĩ(Θ0)|−1/2 is the slope of the axis θ2 after transformation and
We can then show that
The representations of infΘ∈C̃||U−Θ|| are different in the four regions of the plane with coordinates (θ1, θ2)T
The area proportions (π, 1/4, 1/4,1/2−π) as in the aforementioned order, of these four regions determine the probabilities that the vector U lies in which region, where π = cos−1{η · (1+η2)−1/2} = Ĩ12 · (Ĩ11Ĩ22)−1/2. Then the asymptotic distribution of D is the difference of the above two representations
Because U1 and U2 are independent, , and the final approximate asymptotic distribution of D is .
4.2. Permutation test
For identifying significant pathways, we also consider using the following permutation procedures.
Step 1: estimate τ/ρ, ρ, τ using the observed data by fitting the semiparametric model and calculate the residuals ε̂0i from yi = xiβ + ε0i.
Step 2: permute the residual and simulate outcomes as .
Step 3: based on y*, x, and z, fit the semiparametric model using the likelihood-based approach and then estimate τ̂*/ρ̂*, ρ̂*, and τ̂*.
Step 4: repeat Step 1–Step 3 a large number of times, e.g., 10,000 times.
Step 5: estimate the statistical significance by the percentage of times either τ̂*/ρ̂*>τ̂/ρ̂ or ρ̂*>ρ̂, where τ̂/ρ̂, ρ̂, τ̂ are the estimated values from the observed data.
Significant pathways can be selected based on this percentage, which is equivalent to a statistical significance.
To rank the importance of genes within a significant pathway related to clinical outcomes, we perform the following steps:
- Step 6: bootstrap the samples B times. For each bootstrapped sample, we estimate , and by fitting the following model, where b = 1, …, B,
Step 7: calculate the absolute value of difference, , and , under the Gaussian kernel. For other kernels, only calculate the absolute value of difference .
Step 8: rank the importance of the genes by the mean absolute difference . If a gene plays an importance role in a pathway, this difference will be large.
We can also rank the importance of gene pairs by performing the same procedure except for fitting the following model
5. Simulation studies
5.1. Mean squared error and coverage probability
We conducted simulations to compare the MSEs and coverage probabilities of LLG’s estimators with those of our estimators.
We considered the following cases and simulated 1000 data sets with two sets of n and p values. One set is (n, p) = (60, 5) with a relatively large sample size compared to the number of genes and the other set is (n, p) = (60, 200) with a relatively small sample size compared to the number of genes.
Case 1: yi = βxi + r(zi1, zi2, …, zip)+ εi with β = 1, xi = 3 cos(zi1)+2ui with ui being independent of zi1 and following N(0, 1), zij ~ Uniform(0, 1), and the true r(z) ~ GP{0, τKg(ρ)}, where Kg(ρ)(z, z′) = exp(−||z−z′||2/ρ) and ||·|| was used as Euclidean norm.
Case 2: the same setting as that of case 1 except that r(z) ~ GP(0, τKp2), where Kp2(z, z′) = (zz′)2.
Case 3: the same setting as that of case 1 except that r(z) ~ GP(0, τKp1), where Kp1(z, z′) = (zz′)1.
For each case, we estimated parameters in the semiparametric regression model using the LLG’s REML and our REML. We calculated the average MSEs and coverage probability of the parameter estimates. The results are summarized in Table 1. For both n>p and n<p cases, our REML estimators have smaller MSEs than that of LLG’s REML estimators. Both approaches have comparable coverage probability.
Table 1.
The coverage probability (cvpr) of 95% confidence intervals and average mean squared error values (mse) of parameter estimates. LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel; P2K=Quadratic kernel; P1K=Linear kernel.
| The number of sample size and gene | K | Method | Case | Coverage probability and mean squared error | β | σ | τ | ρ |
|---|---|---|---|---|---|---|---|---|
| n=60 | GK | LLIKE | 1 | cvpr × 100 | 97 | 96 | 91 | 90 |
| p=5 | mse | 0.0053 | 0.0247 | 0.0483 | 0.0623 | |||
| PLIKE | cvpr | 97 | 96 | 92 | 91 | |||
| mse | 0.0048 | 0.0151 | 0.0322 | 0.0458 | ||||
| P2K | LLIKE | 2 | cvpr × 100 | 98 | 93 | 95 | N/A | |
| mse | 0.0053 | 0.0485 | 0.0252 | N/A | ||||
| PLIKE | cvpr × 100 | 98 | 93 | 95 | N/A | |||
| mse | 0.0050 | 0.0474 | 0.0243 | N/A | ||||
| P1K | LLIKE | 3 | cvpr × 100 | 97 | 93 | 95 | N/A | |
| mse | 0.0051 | 0.0104 | 0.0165 | N/A | ||||
| PLIKE | cvpr × 100 | 97 | 93 | 95 | N/A | |||
| mse | 0.0042 | 0.0095 | 0.0142 | N/A | ||||
| n=60 | GK | LLIKE | 1 | cvpr × 100 | 89 | 87 | 87 | 80 |
| p=200 | mse | 0.0321 | 0.0436 | 0.0424 | 0.1261 | |||
| PLIKE | cvpr × 100 | 89 | 89 | 87 | 80 | |||
| mse | 0.0304 | 0.0295 | 0.0421 | 0.112 | ||||
| P2K | LLIKE | 2 | cvpr × 100 | 86 | 63 | 87 | N/A | |
| mse | 0.0478 | 0.4683 | 0.0388 | N/A | ||||
| PLIKE | cvpr × 100 | 86 | 63 | 87 | N/A | |||
| mse | 0.0475 | 0.4620 | 0.0379 | N/A | ||||
| P1K | LLIKE | 3 | cvpr × 100 | 86 | 65 | 82 | N/A | |
| mse | 0.0486 | 0.2984 | 0.0583 | N/A | ||||
| PLIKE | cvpr | 86 | 66 | 85 | N/A | |||
| mse | 0.0480 | 0.2839 | 0.0564 | N/A |
5.2. Consistency
We also performed additional simulation to study the increased accuracy of our estimates as n increases when p is fixed. We considered n=15, 30, 60, and 600, for one small, two medium, and one relatively large sample sizes. For each case, we simulated 1000 data sets and average MSEs are summarized in Table 2.
Table 2.
The average mean squared error values (mse) of parameter estimates for different sample sizes. LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel.
| The number of sample size and gene | K | Method | Case | Coverage probability and mean squared error | β | σ | τ | ρ |
|---|---|---|---|---|---|---|---|---|
| n=15 | GK | LLIKE | 1 | mse | 0.0865 | 0.0774 | 0.0799 | 0.0974 |
| p=5 | PLIKE | mse | 0.0741 | 0.0643 | 0.0673 | 0.0791 | ||
| n=30 | GK | LLIKE | 1 | mse | 0.0067 | 0.0319 | 0.0529 | 0.0723 |
| p=5 | PLIKE | mse | 0.0062 | 0.0234 | 0.0436 | 0.0558 | ||
| n=60 | GK | LLIKE | 1 | mse | 0.0053 | 0.0247 | 0.0483 | 0.0623 |
| p=5 | PLIKE | mse | 0.0048 | 0.0151 | 0.0322 | 0.0458 | ||
| n=600 | GK | LLIKE | 1 | mse | 0.0025 | 0.0114 | 0.0235 | 0.0312 |
| p=5 | PLIKE | mse | 0.0020 | 0.0107 | 0.0197 | 0.0259 |
5.3. Type I error and power
For the assessment of type I error and power for both approaches, we considered the following Case 4 for type I error and Case 5 for power, respectively. We consider Case 5 because the nonparametric function r(zi1, zi2, …, zip) is allowed to have a complex form with nonlinear functions of the z’s and interactions among the z’s. It allows for xi and (zi1, …, zip) to be correlated. We obtained type I error and power using a profile restricted likelihood ratio test (PRLRT) described in Section 4.1. We performed PRLRT using both empirical distribution as well as theoretical distribution. We denote them as “PRLRT(e)” and “PRLRT(t)”, respectively. We also compared them with the permutation test described in Section 4.2. This permutation is denoted as “PERM”.
Case 4: yi = βxi +εi with β = 1, xi = 3 cos(π/6)+ 2ui with ui ~ N(0, 1) and εi ~ N(0, σ2).
Case 5: yi = βxi +r(zi1, zi2, …, zip)+εi, xi = 3 cos(zi1)+ 2ui with ui being independent of zi1 and following N(0, 1), zij ~ Uniform(0, 1), and .
The estimated type I error rates from both approaches are summarized in Tables 3–5 and they are all close to the nominal level when n>p. However, when n<p, both methods differ from the nominal level. The power for both approaches are comparable. The performance between “PRLRT (e)” and “PRLRT (t)” was similar to each other, so was the performance between “PRLRT” and “PERM”.
Table 3.
Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; GK=Gaussian kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.
| The number of sample size and gene | Type I error and power | Kernel (GK) | LLIKE
|
PLIKE
|
||||
|---|---|---|---|---|---|---|---|---|
| PERM | PRLRT(e) | PRLRT(t) | PERM | PRLRT(e) | PRLRT(t) | |||
| Case | ||||||||
| n = 60 | Type I | 4 | 0.04 | 0.05 | 0.05 | 0.04 | 0.05 | 0.05 |
| p = 5 | Power | 5 | 0.99 | 0.99 | 1 | 0.99 | 0.99 | 1 |
| n = 60 | Type I | 4 | 0.03 | 0.04 | 0.04 | 0.03 | 0.04 | 0.04 |
| p = 200 | Power | 5 | 0.84 | 0.84 | 0.84 | 0.84 | 0.85 | 0.85 |
Table 5.
Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; P1K=Linear kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.
| The number of sample size and gene | Type I error and power | Kernel (P1K) | LLIKE
|
PLIKE
|
||||
|---|---|---|---|---|---|---|---|---|
| PERM | PRLRT(e) | PRLRT(t) | PERM | PRLRT(e) | PRLRT(t) | |||
| Case | ||||||||
| n = 60 | Type I | 4 | 0.04 | 0.05 | 0.05 | 0.04 | 0.05 | 0.05 |
| p = 5 | Power | 5 | 0.97 | 1 | 1 | 0.97 | 1 | 1 |
| n = 60 | Type I | 4 | 0.03 | 0.04 | 0.04 | 0.03 | 0.04 | 0.04 |
| p = 200 | Power | 5 | 0.82 | 0.83 | 0.83 | 0.82 | 0.83 | 0.83 |
6. Real data analysis
6.1. Pathway based analysis for type II diabetes
We applied our profile likelihood approach to a microarray expression data set on type II diabetes (Mootha et al., 2003), where a microarray expression data from 22,283 genes were measured in 17 male patients with normal glucose tolerance and 18 male samples with type II diabetes mellitus. Because their approach is based on a normalized Kolmogorov–Smirnov statistics, their method is unable to model the continuous outcome, such as glucose level, and clinical covariates such as age. Incorporating such information into analysis might be helpful to more efficiently detect subtle differences in gene expression profiles. We studied a total of 277 pathways consisting of 128 KEGG pathways (http://www.genome.jp/kegg/pathway.html) and 149 curated pathways. The 149 curated pathways were constructed from known biological experiments by Mootha and colleagues.
In our analysis, let Y be the log transformed glucose level, X be the age, and Z be the p × n gene expression levels within each pathway, where n is 35, i.e. the number of subjects, p is the number of genes in a specific pathway which varied from 4 to 200 across these pathways. Our goal is to identify pathways that affect the glucose level related to diabetes after adjusting for the age effect and also to rank genes within a significant pathway. To identify significant pathways, we fitted the semiparametric model.
6.2. Identifying significant pathways
We chose the 0.05 cut off for the statistical significance using both theoretical and empirical distributions of PRLRT and the percentage which was described in Section 4.2. We also applied existing multiple comparison methods (Storey, 2002, 2003) to our pathway data, although our pathways are not independent of each other because of shared genes and interactions among pathways. The FDR q-values were between 0.081 and 0.303. Our approach using both theoretical and empirical distributions took about 60 min to run, using a Mac Pro with two 3.0 GHz Quad-Core Intel Xeon processors, and 10 GB of memory. Our code is written in Matlab and is available upon request.
For overlapping pathways across four kernels, we selected top 50 pathways for each kernel and then examined the common pathways across four kernels. We found that a total of seven pathways were common among the four kernels, including Alanine and aspartate metabolism, Oxidative phosphorylation, and RNA polymerase pathways. Since one of them is a subset of another pathway, six pathways are summarized in Table 6.
Table 6.
Seven pathways which are significant over four kernels (Linear, Quadratic, Gaussian, Neural network kernels) using pathway data of type II diabetes; P1K=linear kernel; P2K=quadratic kernel; GK=Gaussian kernel; NNK=neural network kernel; P-values are obtained using both profile restricted likelihood ratio test (PRLRT) and permutation test (PERM); PRLRT(t) is a profile restricted likelihood ratio test which is performed using theoretical distribution.
| Pathway ID | Pathway name (# of genes) | P1K
|
P2K
|
GK
|
NNK
|
||||
|---|---|---|---|---|---|---|---|---|---|
| PERM | PRLRT(t) | PERM | PRLRT(t) | PERM | PRLRT(t) | PERM | PRLRT(t) | ||
| 4 | Alanine and aspartate metabolism (18) | 0.009 | 0.008 | 0.003 | 0.002 | 0.034 | 0.030 | 0.016 | 0.014 |
| 36 | c17_U133_probes (116) | 0.029 | 0.031 | 0.002 | 0.001 | 0.003 | 0.002 | 0.015 | 0.013 |
| 133 | MAP00190_Oxidative_phosphorylation (58) | 0.025 | 0.027 | 0.007 | 0.007 | 0.032 | 0.029 | 0.022 | 0.025 |
| 229 | Oxidative_phosphorylation (113) | 0.023 | 0.021 | 0.008 | 0.009 | 0.035 | 0.039 | 0.024 | 0.022 |
| 209 | MAP03020_RNA_polymerase (21) | 0.034 | 0.031 | 0.027 | 0.026 | 0.040 | 0.041 | 0.025 | 0.026 |
| 254 | RNA polymerase (25) | 0.037 | 0.038 | 0.028 | 0.023 | 0.041 | 0.043 | 0.022 | 0.021 |
Pathway 4 is related to the Alanine and aspartate metabolism pathway which has been studied for its association with abnormal hepatocellular functions and abnormal fasting glucose level alanine and aspartate metabolism in type II diabetes (Jiamjarasrangsi et al., 2009). Two other pathways out of six pathways, pathways 133 and 229, where all except one gene in pathway 133 belong to pathway 229, are related to the Oxidative phosphorylation which is known to be associated to diabetes (Misu et al., 2007; Mootha et al., 2003, 2004). This is a process of cellular respiration in humans (or in general eukaryotes). These pathways contain coregulated genes across different tissues and are related to insulin/glucose disposal. They consist of ATP synthesis, a pathway involved in energy transfer. The other two pathways, pathways 209 and 254, are related to RNA polymerase. All genes except two genes in pathway 209 are part of pathway 254. Among all the pathways, pathway 36, c17_U133_probes, is the most significant one under the Gaussian kernel. This pathway plays a role in cellular behavioral changes (Saxena, 2001) and is also one of the seven pathways that are common among the four kernels. It contains several genes, e.g. CAP1, MAPP2K6, ARF6, and SGK, related to human insulin signaling (Dahlquist et al., 2002). Only one gene in pathway 36 is included in pathway 254. These genes were not significant using single gene based analysis. We also note that all genes in the Oxidative phosphorylation pathway were not significant using single gene based analysis. Only one gene, GAD2, in pathway 4 is significant using single gene based analysis.
We compared the top 50 pathways identified by global test (Goeman et al., 2004) and GSEA (Subramanian et al., 2005). Global test is based on a random effects model which does not include covariates and uses a linear kernel. GSEA calculates enrichment score which is weighed function of correlation among genes in a pathway. This GSEA method cannot include covariate information into this enrichment score value. The GSEA result and global test results are summarized in supplementary materials.
The proportions of overlap among global, GSEA, and our approach were calculated. The proportions of overlap between the global and GSEA was 0.36. The largest proportion among the GSEA and our approaches was 0.41 using the quadratic kernels. These results suggest that the proportions of overlap among different methods were small, meaning that they detected different pathways. When our approach was used with four different kernels, the largest overlap was 0.92 between linear and quadratic kernels. Global test and GSEA approaches found that pathways 4, 140, and 229 are significant. But they could not detect pathways 36, 133, and 254, while our approach detects them all.
For kernel selection, we used Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978), where AIC = n log {(Y−Ŷ)T(Y−Ŷ)}+2r and BIC = n log{(Y−Ŷ)T(Y−Ŷ)}+ r log(n), Ŷ = LY, L = (I+λ−1K)−1[λ−1K+X{XT(I+λ−1K)−1X}−1XT(I+λ−1K)−1], and r = rank(L).
We found that all seven pathways are in the top 15 pathway list having the smallest AIC and BIC values for all kernels. Pathway 36, c17_U133_probes, has the smallest AIC and BIC values with the Gaussian kernel among four kernels, whereas pathway 229, Oxidative phosphorylation, has the smallest values with the quadratic kernel among four kernels. Pathway 254, RNA polymerase, has the smallest values with the Neural network kernel among four kernels. Pathways 4, 36, and 229 have small values among all pathways with quadratic kernel.
7. Discussion
In this paper, we have derived the asymptotic properties of semiparametric regression for identifying pathway effects on clinical outcomes after controlling for covariates. We compared LLG’s REML with our REML obtained from a profile likelihood. We showed that the estimators obtained from our REML and LLG’s REML are both consistent, have n convergence, and asymptotically follow normal distribution when the true parameters are interior points of the parameter space and a mixture of normal distribution when the one component of true parameters is a left endpoint of the parameter space. However, our REML gives more accurate score equations and information matrix, and have smaller MSEs than those of LLG’s REML. A profile restricted likelihood ratio test is also provided for the non-standard testing problem in our application.
The choice of an appropriate kernel poses a significant practical issue. We used kernel selection approaches based on AIC and BIC (Liu et al., 2007). Kim et al. (2012) proposed Bayes factor-based approach on kernel selection. However, this Bayes factor-based approach is computational expensive. Kernel selection can be considered as a model selection problem with the kernal machine framework. More general and flexible methods on model selection for covariance matrix estimation may be explored here and are worth future research.
We note that we analyze each pathway separately in our analysis. It is known that pathways are not independent of each other because of shared genes and interactions among pathways, which makes difficult to adjust p-value. Because existing multiple comparison methods based on false discovery rates (Benjamini and Hochberg, 1995; Storey, 2002, 2003) are developed in single gene based analysis under independence assumption or known positive dependence structure among genes, they are not applicable in a pathway based analysis, where pathways are not independent of each other. Developing such multiple comparison method for pathway based analysis will be a challenging problem because of complex dependence structure among pathways.
It is also important to generalize the semiparametric model (Hastie and Tibshirani, 1990; Tibshirani, 1996) to incorporate multiple pathways using generalized additive model and the multivariate adaptive regression splines (Friedman, 1991).
Supplementary Material
Table 4.
Estimated type I error rate and power based on profile restricted likelihood ratio test (PRLRT) and permutation test (PERM). LLIKE=LLG’s likelihood-based approach with kernel K; PLIKE=our profile likelihood-based approach with kernel K; P2K=Quadratic kernel; PRLRT(e) and PRLRT(t) are profile restricted likelihood ratio tests which are performed using empirical distribution and theoretical distribution, respectively.
| The number of sample size and gene | Type I error and power | Kernel (P2K) | LLIKE
|
PLIKE
|
||||
|---|---|---|---|---|---|---|---|---|
| PERM | PRLRT(e) | PRLRT(t) | PERM | PRLRT(e) | PRLRT(t) | |||
| Case | ||||||||
| n = 60 | Type I | 4 | 0.04 | 0.05 | 0.05 | 0.04 | 0.05 | 0.05 |
| p = 5 | Power | 5 | 0.97 | 1 | 1 | 0.98 | 1 | 1 |
| n = 60 | Type I | 4 | 0.03 | 0.04 | 0.04 | 0.03 | 0.04 | 0.04 |
| p = 200 | Power | 5 | 0.83 | 0.85 | 0.85 | 0.84 | 0.85 | 0.85 |
Acknowledgments
This study was supported in part by NIH Grants GM-59507, N01-HV-28186 and P30-DA-18343, National Science Foundation Grant DMS 1106738, and a pilot grant from the Yale Pepper Center P30AG021342.
Appendix A. Technical complements
A.1. Score equations and information matrix
We introduce the following notations to simplify our discussion:
where Kρ(·, ·) and Kρρ(·, ·) are the first and second derivatives of K(·, ·) with respect to ρ. Let K, Kρ, and Kρ,ρ be n × n matrices. Their entries consist of K(·, ·), Kρ(·, ·), and Kρρ(·, ·), respectively. We recall the following notations:
Based on these notations and the calculations of
we obtain the first and second derivatives of Σ and Σ−1 with respect to τ as follows Στ = K, Σττ = 0 × I, , and .
We also calculate the first and second derivatives of Σ and Σ−1 with respect to ρ as follows: Σρ = τKρ, Σρρ = τKρρ, , and .
The first and second derivatives of Σ and Σ−1 with respect to σ2 are Σσ2 = I, Σσ2σ2 = 0 × I, , and .
We obtain the second derivatives of Σ and Σ−1 with respect to τ and ρ, τ and σ2, and ρ and σ2 as follows:
The first derivatives of XTΣX with respect to τ, ρ, σ2 are
The second derivatives of (XΣX)−1 with respect to τ, ρ, σ2 are
The first and second derivatives of H with respect to τ are
The first and second derivatives of H with respect to ρ are
The first and second derivatives of H with respect to σ2 are
The second derivatives of H with respect to τ and ρ, τ and σ2, ρ and σ2 are
The first derivatives of P with respect to τ, ρ, and σ2 are
The score equations based on our REML l2(Θ) are
The second derivatives of l2(Θ) with respect to τ, ρ, and σ2 are
The information matrix based on our REML l2(Θ) is
Also the score equations and information matrix of Liu et al. (2007) are
and
A.2. Regularity conditions
RC1
The observations D = (X, Z, Y) have probability density f(D, Θ) with respect to some measure μ and parameter Θ = (θ1, …, θk). f(D, Θ) has a common support and the model is identifiable. Furthermore, the first and second derivatives of log(f) satisfy
and
RC2
The Fisher information matrix
is finite and positive definite at Θ = Θ0, where Θ0 is the true parameter value.
RC3
There exists an open subset ω of Θ containing the true parameter vector Θ0 such that for almost all D the density f(D, Θ) admits all third derivatives
Furthermore there exists functions Mijl such that
where mijl = EΘ0[Mijl(D)] < ∞ for i, j, l.
RC4
Near Θ0 the parameter space Ω behaves like a closed set; this condition was used by Self and Liang (1987) and Vu and Zhou (1997).
A.3. Proof of Theorem 1
Recall that LLG’s REML and our REML are denoted as l1(·) and l2(·), and Θ̂1 and Θ̂2 denote LLG’s estimator and our estimator, respectively. In the following derivation, we use l(·) and Θ̂ to denote the loglikelihood and estimator of either approach to simplify discussion. The true parameter value is denoted by Θ0.
Let αn = n−1/2 + an, where
We want to show that for any ε > 0 there exists a constant C such that
| (A.1) |
This implies with probability at least 1−ε that there exists a local maximum in the ball {Θ0 + αnu: ||u|| ≤ C}. Hence, there exists a local maximizer Θ̂ such that ||Θ̂ − Θ0|| = O(αn). Since the intersection of the parameter space Ω and the closure of a neighborhood about Θ0 constitute of closed intervals, there also exist a local maximizer on this set although Θ0 is on the boundary of parameter space Ω. Let L′(Θ0) be the gradient vector of loglikelihood function L. By the standard argument on the Taylor expansion of the likelihood function and if an = o(1), we have
| (A.2) |
Note that n−1/2L′(Θ0) = O(1). Thus, the first term on the right-hand side of (A.2) is on the order . By choosing a sufficiently large C, the second term dominates the first term uniformly in ||u|| = C. The third term is also dominated by the second term. Hence, by choosing a sufficiently large C, (A.1) holds when Θ0 is interior of Ω as well as when Θ0 is on the boundary of Ω. This completes the proof of (1.1) in Theorem 1.
A.4. Proof of Lemma1
Since Σ(Θ) is symmetric matrix, we can express it as
where λin(Θ) denotes the ordered eigenvalues of Σ(Θ) and Θ = (θ1, …, θk). For easy notation, we let λin(Θ) ≡ λin and Σ(Θ) ≡ Σ. We define
Since
we have
Hence, the mth diagonal element of PΣk is
Therefore, we have
Since
are the same when n goes ∞,
by the law of large numbers.
A.5. Conditions C1–C4
Cressie and Lahiri (1993, 1996) showed that a general result for the asymptotic property of REML estimator Θ̂reml in a parametric linear model with Gaussian error, Y ~ Nn(Xβ, Σ(Θ)), holds under conditions C1–C4, which we describe later in this section, where Y is an n × 1 data vector (Y1, …, Yn)T, X is an n × q matrix of explanatory variables, β is a q × 1 vector of unknown large scale effects, and Σ(Θ) is an n × n positive definite variance matrix which is known up to a k × 1 vector of small scale effects Θ = (θ1, …, θk). In our studies, Σ(Θ) = σ2I+τK is a positive definite matrix. When we use the Gaussian kernel, Θ = (τ, ρ, σ2) and k = 3. For the polynomial and neural network kernels, Θ = (τ, σ2) and k = 2.
Let U = Γ′Y represent a vector of n–s linearly independent error contrast and; that is, the (n–s) columns of U are linearly independent and U ~ N(0, Γ′Σ(θ)Γ). If a set of (n–s) linearly independent contrasts are used to define U, the new negative loglikelihood function is
where P = Σ−1−Σ−1X(XTΣ−1X)−1 XTΣ−1. A REML estimator of θ can be obtained by minimizing this equation. Let φ(Θ) = (∂2LU(Θ)/∂θi∂θj), the k × k matrix of the second-order partial derivatives of the negative loglikelihood function LU(·). Then E(φ(Θ)) = tr(PΣi(Θ)PΣj(Θ)/2). Under conditions C1–C4, Cressie and Lahiri (1993, 1996) showed that
In our semiparametric regression setting, we show that their result holds under certain situations which are described in Lemmas 1 and 2. By using Lemmas 1 and 2 of our paper and the result of Cressie and Lahiri (1993, 1996), there exists a c̄ such that (1/n) tr(PΣi(θ)PΣj(θ)/2)→uc̄, where →u denotes uniform convergence. Therefore, we can show that
The conditions needed for applying the result of Cressie and Lahiri (1993, 1996) are
C1. Σ(Θ) is twice continuously differentiable on Θ.
-
C2. For all c > 0, η > 0, let ||B|| = {tr(B′B)}1/2,
- with
- C3. There exists a positive definite matrix W(θ), continuous in θ, such that Qn(θ) uniformly converge to W(θ), where
-
Let |λ1n(Θ)| ≤ · · · ≤ |λnn(Θ)| denote the ordered absolute eigenvalues of Σ(Θ), denote those of Σij = ∂2Σ(Θ)/∂Θi∂Θj. Suppose there is a sequence (rn)n ≥ 1 with lim supn→∞rn/n ≤ 1−δ*, for some δ* ∈ (0, 1), such that for any compact subset Ωs ⊆ Ω, there exist constants 0 < C1(Ωs) < ∞ and η1(Ωs) > 0 such that
uniformly in Θ ∈ Ωs.
A.6. Proof of Theorem 2
We check the validity of assumptions (C1), (C2), (C3), and (C4) as follows:
Conditions (C1) and (C2,ii) require smoothness of the variance–covariance matrix Σ(Θ) as a function of Θ. Since our Σ(Θ) = τK(ρ)+σ2I is a smooth function of θ, the conditions (C1) and (C2,ii) hold.
Since [Eθ{φn(Θ)}1/2]n ≥ 1 is smooth to guarantee (C2.i), condition (C2.i) holds by the arguments of Cressie and Lahiri (1996).
-
C3. We show that condition (C3) holds under certain situation described in Lemmas 1 and 2. When we use the Gaussian, polynomial, and neural network kernels, Σ and Στ are simultaneously diagonalizable as well as Σ and Σσ2 are. For the Gaussian kernel, we also need to show that Σ and Σρ are simultaneously diagonalizable when n goes to ∞. We can show this using Lemma 2 as follows:
-
The ijth elements of matrix KKρ and KρK are
Since Ez(KKρ−KρK) = 0 when n→∞, KKρ = KρK as n→∞. Therefore, K and Kρ are almost simultaneously diagonalizable for large n. This means that, Σ and Σρ are almost simultaneously diagonalizable for large n. From Lemma 1, (1/n) tr(PΣρ) converges. Therefore, condition C3 holds if Σ and Σk are simultaneously diagonalizable.
-
C4. Since Σ(Θ) and Σi(Θ) are nonsingular and uniformly bounded over a compact subset of Ω, this condition holds.
The proof of Theorem 2 is done when Θ0 is interior of parameter space Ω.
Since the intersection of parameter space Ω and the closure of a neighborhood about Θ constitute of closed intervals, Theorem 2 also holds when Θ0 is on the boundary of parameter space Ω by arguments in Geyer (1994) and Vu and Zhou (1997) who extended the result of Chernoff (1954) to nonidentically distributed sampling. Geyer’s (1994) result is based on a sampling model that is essentially a stationary process, while Vu and Zhou (1997) has no such restriction and allows general nonidentically distributed sampling so that models with covariances can be included.
We can demonstrate that our results hold using Vu and Zhou (1997) by relating their conditions with ours. Vu and Zhou (1997) provided that the existence, consistency, and asymptotic properties of the local maximum estimators covering a large class of estimation problems which allow sampling from nonidentically distributed random variable under certain conditions, A1–A2, B1–B4, in their paper: Regularity condition RC3 of our paper infers A1 in -Vu and Zhou (1997); RC4 of our paper suggest A2; RC1 and 2 of our paper imply B1; RC2 infers B2; Condition C2 in A.5 of our paper infers B3 and B4 in Vu and Zhou (1997). Therefore, under regularity conditions, Theorem 2 also holds when Θ0 is on the boundary of parameter space Ω.
We can also show our results using Geyer (1994) but we need to add his assumption D. Geyer (1994) assumed that the sampling distribution has a stationary process and showed that the existence, consistency, and asymptotic properties of the local maximum estimators under Assumptions A–D: since our density function f(D, Θ) is in C3 in Θ, Assumption A in Geyer (1994) is satisfied; RC3 infers Assumption B in Geyer (1994); Assumption C in Geyer (1994) is
for some covariance matrix A. This assumption C is satisfied by the weak law large number theory and central limit theory; Assumption D requires that the estimating sequence {Θ̂n} = Θ0 + op(1). Therefore, Theorem 2 also holds under regularity conditions.
A.7. Proof of Theorem 3
For consistent estimators Θ̂ using a first order Taylor expansion of l(·) near Θ0 yields,
where Θ̄ is a vector between Θ̂ and Θ0. Consequently, we have
Next, we show that, (i) the limit distribution of
and (ii)
as n→∞
To prove (i), note that
The first term converges in distribution to N(0, I(Θ0)); the second term goes to zero with o(n−1/2) because (XTΣ−1X)−1(XTΣ−1ΣΘΣ−1X) = o(1). Thus, (i) holds.
For (ii), since
and again the second term of the right side goes to zero, the first term converges in probability to I(Θ0) (Lehmann, 1983).
Thus, applying Slutsky’s lemma and converting back to the original parameter space via the delta method, the asymptotic normality holds. The proof of (3.1) is done.
The proof of (3.2) can be shown by Chant’s (1974) case (i). Suppose that Ω = Ω1 × Ω2 × Ω3, Θ1 is a left endpoint of Ω1, and other components of Θ are interior points of Ωi, i = 2, 3. Then Θ̂ can be expressed as
where N is a random variable with a multivariate Gaussian distribution with mean Θ and covariance I−1(Θ0), where Θ is restricted to lie in CΩ–Θ0 and CΩ is a cone with vertex at Θ0. This expression corresponds to the result of Chant’s (1974) case (i).
A.8. Proof of Theorem 4
If Θ̂ is consistent then by a Taylor series approximation
Therefore,
where Θ̄ is a vector between Θ̂ and Θ0, li.Θ = ∂li(Θ)/∂ΘT, and li.ΘΘ = ∂li.Θ(Θ)/∂ΘT, i = 1, 2.
By applying chain-rule, we obtain the first derivative and second derivatives of REML with respect to Θ as follows:
and
Since ||l2.ΘΘ|| >||l1.ΘΘ||, we obtain the following inequality:
Therefore, E||Θ̂1−Θ||2 > E||Θ̂2−Θ||2. This means that the estimators based on our REML have theoretically smaller mean square error than those of LLG’s REML. The proof of Theorem 4 is done.
A.9. Non-identifiability between τ and ρ when ρ→0
If τ ~ O(1/ρm) for any positive value m and ρ ~ O{E(||z−z′||2)}, Θ̂2 is asymptotically normally distributed with mean Θ and covariance matrix I(Θ). But if ρ→0, Hτ = Hττ = Pτ = Pττ = 0 and Hρ = Hρρ = Pρ = Pρρ = 0, then the asymptotic distributions of τ̂2 and ρ̂2 are the same and degenerated.
For LLG’s estimator Θ̂1, as ρ→0, the asymptotic distributions of τ̂1 and ρ̂1 are also the same because the information matrix of Liu et al. (2007) is
and Σθl → 0.
A.10. The proof of equivalence of the two testings
The testing of H0: {r(Z) is a point mass at zero} ∪ {r(Z) has a constant covariance matrix as a function of z} is equivalent to the testing of ∂K(Z)/∂Z = 0.
If ρ→0 and τ→0 with faster rate O(ρm), then ∂K(Z)/∂Z = 0. That is, if τ/ρ→0, then ∂K(Z)/∂Z = 0.
If ρ→∞, then 0 ≤ exp(−||zi−zj||2/ρ) ≤ 1. Therefore,
Hence, if
If ρ→0 and τ ~ O(1/ρm), then exp(−||zi−zj||2/ρ)→0. Therefore, ∂K(Z)/∂Z = 0. Therefore, if τ/ρ→0 or ρ→0, then ∂K(Z)/∂Z = 0.
Appendix B. Supplementary
Some additional results and information on pathway and genes are available in a separate file for supplementary materials.
Supplementary data associated with this paper can be found in the online version at http://dx.doi.org/10.1016/j.jspi.2012.09.009.
References
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B. 1995;57:289–300. [Google Scholar]
- Chant D. On asymptotic tests of composite hypotheses in nonstandard conditions. Biometrika. 1974;61:291–298. [Google Scholar]
- Chernoff On the Distribution of the Likelihood Ratio. Annals of Mathematical Statistics. 1954;25:573–578. [Google Scholar]
- Claeskens G. Restricted likelihood ratio lack of fit tests using mixed spline models. Journal of the Royal Statistical Society: Series B. 2004;66:909–926. [Google Scholar]
- Cressie N, Lahiri SN. The asymptotic distribution of REML estimators. Journal of Multivariate Analysis. 1993;45:217–233. [Google Scholar]
- Cressie N, Lahiri SN. Asymptotic for REML estimation of spatial covariance parameters. Journal of Statistical Planning and Inference. 1996;50:327–341. [Google Scholar]
- Cristianini N, Shawe-Taylor J. Kernel Methods for Pattern Analysis. Cambridge University Press; Cambridge: 2006. [Google Scholar]
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics. 2002;31:19–20. doi: 10.1038/ng0502-19. [DOI] [PubMed] [Google Scholar]
- Efron B, Hastie T, Johnstine I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407–499. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Friedman JH. Multivariate adaptive regression splines (with discussion) The Annals of Statistics. 1991;19:1–141. [Google Scholar]
- Geyer CJ. On the asymptotics of constrained M-estimation. The Annals of Statistics. 1994;22:1993–2010. [Google Scholar]
- Goeman J, van de Geer SA, Kort F, Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20:93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
- Gu C, Ma P. Optimal smoothing in nonparametric mixed effect models. Annals of Statistics. 2005;33:1357–1379. [Google Scholar]
- Harris MA, et al. The gene ontology (GO) database and informatics resource. Nucleic Acids Research. 2004;32:D258–D261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall, CRC; 1990. [Google Scholar]
- Hosack DA, Dennis G, Jr, Sherman BT, Clifford H, Lempicki RA. Identifying biological themes within lists of genes with EASE. Genome Biology. 2003;4 (10):R70. doi: 10.1186/gb-2003-4-10-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiamjarasrangsi W, Lertmaharit S, Sangwatanaroj S, Lohsoonthorn V. Type 2 diabetes, impaired fasting glucose, and their association with increased hepatic enzyme levels among the employees in a University Hospital in Thailand. Journal of the Medical Association of Thailand. 2009;92:961–968. [PubMed] [Google Scholar]
- Kim I, Pang H, Zhao H. Bayesian semiparametric regression models for evaluating pathway effects on clinical continuous and binary outcomes. Statistics in Medicine. 2012;31:1633–1651. doi: 10.1002/sim.4493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmann EL. Theory of Point Estimation. John Wiley; New York: 1983. [Google Scholar]
- Liu D, Lin X, Ghosh D. Semiparametric regression of multi-dimensional genetic pathway data: least square kernel machines and linear mixed models. Biometrics. 2007;63:1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Misu H, Takamura T, Matsuzawa N, Shimizu A, Ota T, Sakurai M, Ando H, Arai K, Yamashita T, Honda M, Yamashita T, Kaneko S. Genes involved in oxidative phosphorylation are coordinately upregulated with fasting hyperglycaemia in livers of patients with type 2 diabetes. Diabetologia. 2007;50:268–277. doi: 10.1007/s00125-006-0489-8. [DOI] [PubMed] [Google Scholar]
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson P, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop L. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34:267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
- Mootha VK, Handschin C, Arlow D, Xie X, Pierre JS, Sihag S, Yang W, Altshuler D, Puigserver P, Patterson N, Willy PJ, Schulman IG, Heyman RA, Lander ES, Spiegelman BM. Err α and Gabpa/b specify PGC-1 α-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle. Proceedings of the National Academy of Sciences. 2004;101:6570–6575. doi: 10.1073/pnas.0401401101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rajagopalan DA, Agarwal P. Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005;21:788–793. doi: 10.1093/bioinformatics/bti069. [DOI] [PubMed] [Google Scholar]
- Saxena V. PhD Dissertation of Mechanical Engineering. Massachusetts Institute of Technology; 2001. Genomic Response, Bioinformatics, and Mechanics of the Effects of Forces on Tissues and Wound Healing. [Google Scholar]
- Schwarz GE. Estimating the dimensional of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
- Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of American Statistical Association. 1987;82:605–610. [Google Scholar]
- Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics. 2003;31:2013–2035. [Google Scholar]
- Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B. 2002;64:479–498. [Google Scholar]
- Subramanian A, Tamayoa P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub TR, Lander ES, Mesurov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102 (43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Vu HTV, Zhou S. Generalization of likelihood ratio tests under nonstandard conditions. The Annals of Statistics. 1997;25:916–987. [Google Scholar]
- Wang Y. Mixed-effects smoothing spline ANOVA. Journal of the Royal Statistical Society, Series B. 1998;60:159–174. [Google Scholar]
- Zhang D, Lin X, Raz J, Sowers M. Semi-parametric mixed models for longitudinal data. Journal of American Statistical Association. 1998;93:710–719. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society, Series B. 2005;67:301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
