Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements

Lifeng Wang; Hongzhe Li; Jianhua Z Huang

doi:10.1198/016214508000000788

. Author manuscript; available in PMC: 2010 Jan 5.

Published in final edited form as: J Am Stat Assoc. 2008 Dec 1;103(484):1556–1569. doi: 10.1198/016214508000000788

Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements

Lifeng Wang ¹, Hongzhe Li ², Jianhua Z Huang ³

PMCID: PMC2801925 NIHMSID: NIHMS106173 PMID: 20054431

SUMMARY

Nonparametric varying-coefficient models are commonly used for analysis of data measured repeatedly over time, including longitudinal and functional responses data. While many procedures have been developed for estimating the varying-coefficients, the problem of variable selection for such models has not been addressed. In this article, we present a regularized estimation procedure for variable selection that combines basis function approximations and the smoothly clipped absolute deviation (SCAD) penalty. The proposed procedure simultaneously selects significant variables with time-varying effects and estimates the nonzero smooth coefficient functions. Under suitable conditions, we have established the theoretical properties of our procedure, including consistency in variable selection and the oracle property in estimation. Here the oracle property means that the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. The method is illustrated with simulations and two real data examples, one for identifying risk factors in the study of AIDS and one using microarray time-course gene expression data to identify the transcription factors related to the yeast cell cycle process.

key words and phrases : Functional response, Longitudinal data, Nonparametric function estimation, Regularized estimation, Oracle property

1 Introduction

Varying-coefficient (VC) models (Hastie and Tibshirani, 1993) are commonly used for studying the time-dependent effects of covariates on responses measured repeatedly. Such models can be used for longitudinal data where subjects are often measured repeatedly over a given period of time, so that the measurements within each subject are possibly correlated with each other (Diggle et al., 1994; Rice, 2004). Another setting is the functional response models (Rice, 2004), where the ith response is a smooth real function y_i(t), i = 1,…, n, t ∈ Inline graphic = [0, T], although in practice only y_i(t_ij), j = 1,…, J_i are observed. For both settings, the response Y (t) is a random process and the predictor X(t) = (X⁽¹⁾(t),…,X⁽^p⁾(t))^T is a p-dimensional random process. In applications, observations for n randomly selected subjects are obtained as (y_i(t_ij); x_i(t_ij)) for the ith subject at discrete time point t_ij, i = 1,…,n and j = 1,…,J_i. The linear varying-coefficient model can be written as

y_{i} (t_{i j}) = x_{i} {(t_{i j})}^{T} β (t_{i j}) + ε_{i} (t_{i j}),

(1)

where β(t) = (β₁(t),…,β_p(t))^T is a p-dimensional vector of smooth functions of t, and ε_i(t), i = 1,…,n are independently identically distributed random processes, independent of x_i(t). This model has the simplicity of linear models but also has the flexibility of allowing time-varying covariate effects. Since its introduction to the longitudinal data setting by Hoover, Rice, Wu and Yang (1998), many methods have been developed for estimation and inference of Model (1). See, for example, Wu, Chiang and Hoover (1998) and Fan and Zhang (2000) for the local polynomial kernel method; Huang, Wu and Zhou (2002, 2004) and Qu and Li (2006) for basis expansion and the spline methods, and Chiang, Rice and Wu (2001) for the smoothing spline method. One important problem not well studied in the literature is how to select significant variables in Model (1). The goal of this paper is to estimate β(t) nonparametrically and select relevant predictors x_k(t) with non-zero functional coefficient β_k(t), based on observations {(y_i(t_ij), x_i(t_ij)); i = 1,…,n, j = 1,…,J_i}.

Traditional methods for variable selection include hypothesis testing and using information criteria such as AIC and BIC. Recently, regularized estimation has received much attention as a way of performing variable selection for parametric regression models (see Bickel and Li, 2006 for a review). Existing regularization procedures for variable selection include LASSO (Tibshirani, 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) and their extensions (Yuan and Lin, 2006; Zou and Hastie, 2005; Zou, 2006). However, these regularized estimation procedures were developed for regression models where parameters are Euclidean and cannot be applied directly to the nonparametric varying-coefficients models where parameters are nonparametric smooth functions. Moreover, these procedures are studied mainly for independent data; an exception is Fan and Li (2004) where the SCAD penalty was used for variable selection in longitudinal data analysis. On the other hand, some recent results on regularization methods with diverging number of parameters (e.g., Fan & Peng, 2004; Huang, Horowitz & Ma, 2007) are related to nonparametric settings, but the model error due to function approximation has not been of concern in that literature.

Regularized estimation for selecting relevant variables in nonparametric settings has been developed in the general context of Smoothing Spline Analysis of Variance (ANOVA) Models. Lin and Zhang (2006) developed COSSO for component selection and smoothing in Smoothing Spline ANOVA. The COSSO method was extended by Zhang and Lin (2006) to nonparametric regression in exponential families, by Zhang (2006) to Support Vector Machines and by Leng and Zhang (2007) to hazard regression. Lin and Zhang (2006) studied rates of convergence of COSSO estimators; the rest of the papers focused on computational algorithms and did not provide theoretical results. Asymptotic distribution results have not been developed for COSSO and its extensions. Moreover, these approaches have focused on independent data, not on longitudinal data that we study in this paper.

In this paper we use the method of basis expansion for estimating smooth functions in varying coefficient models because of the simplicity of the basis expansion approach as illustrated in Huang et al. (2002). We extend the application of the SCAD penalty to a nonparametric setting. In addition to developing a computational algorithm, we study the asymptotic property of the estimator. We show that our procedure is consistent in variable selection, that is, the probability that it correctly selects the true model tends to one. Furthermore, we show that our procedure has an oracle property, that is, the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. Such results are new in nonparametric settings.

The rest of the paper is organized as follows. We first describe in Section 2 the regularized estimation procedure using basis expansion and the SCAD penalty. A computational algorithm and method of selecting the tuning parameter are given in Section 3. We then present theoretical results in Section 4, including the consistency in variable selection and the oracle property. Some simulation results are shown in Section 5. Section 6 illustrates the proposed method using two real data examples, a CD4 dataset and a microarray time-course gene expression dataset. Technical proofs are gathered in the Appendix.

2 Basis Function Expansion and Regularized Estimation Using the SCAD Penalty

Huang, Wu and Zhou (2002) proposed to estimate the unknown time-varying coefficient functions using basis expansion. Suppose that the coefficient β_k(t) can be approximated by a basis expansion $β_{k} (t) \approx \sum_{l = 1}^{L_{k}} γ_{k l} B_{k l} (t)$ , where L_k is the number of basis functions in approximating the function β_k(t). Model (1) becomes

y_{i} (t_{i j}) \approx \sum_{k = 1}^{p} \sum_{l = 1}^{L_{k}} γ_{k l} x_{i}^{(k)} (t_{i j}) B_{k l} (t_{i j}) + ε_{i} (t_{i j}) .

(2)

The parameters γ_kl in the basis expansion can be estimated by minimizing

\frac{1}{n} \sum_{i = 1}^{n} w_{i} \sum_{j = 1}^{J_{i}} {y_{i} (t_{i j}) - \sum_{k = 1}^{p} \sum_{l = 1}^{L_{k}} γ_{k l} x_{i}^{(k)} (t_{i j}) B_{k l} (t_{i j})}^{2},

(3)

where w_i are weights, taking the value 1 if we treat all observations equally or 1/n_i if we treat each subject equally. An estimate of β_k(t) is obtained by ${\hat{β}}_{k} (t) \approx \sum_{l = 1}^{L_{k}} {\hat{γ}}_{k l} B_{k l} (t)$ where γ̂_kl are minimizers of (3). Various basis systems such as Fourier bases, polynomial bases and B-spline bases can be used in the basis expansion. Huang, Wu and Zhou (2002) studied consistency and rates of convergence of such estimators for general basis choices and Huang, Wu and Zhou (2004) studied asymptotic normality of the estimators when the basis functions B_kl are splines.

Now suppose some variables are not relevant in the regression so that the corresponding coefficient functions are zero functions. We introduce a regularization penalty to (3) so that these zero coefficient functions will be estimated as identically zero. To this end, it is convenient to rewrite (3) using function space notation. Let Inline graphic denote all functions that have the form $\sum_{l = 1}^{L_{k}} γ_{k l} B_{k l} (t)$ for a given basis system {B_kl(t)}. Then (3) can be written as (1/n) $\sum_{i} w_{i} \sum_{j} {y_{i} (t_{i j}) - \sum_{k} g_{k} (t_{i j}) x_{i}^{(k)} (t_{i j})}^{2}$ where $g_{k} (t) = \sum_{l = 1}^{L_{k}} γ_{k l} B_{k l} (t) \in G_{k}$ . Let p_λ(u), u ≥0, be a nonnegative penalty function that depends on a penalty parameter λ. It is assumed that p_λ(0) = 0 and p_λ(u) is nondecreasing as a function of u. Let ||g_k|| denote the L₂-norm of the function g_k. We minimize over g_k ∈ Inline graphic the penalized criterion

\frac{1}{n} \sum_{i = 1}^{n} w_{i} \sum_{j = 1}^{J_{i}} {y_{i} (t_{i j}) - \sum_{k = 1}^{p} g_{k} (t_{i j}) x_{i}^{(k)} (t_{i j})}^{2} + \sum_{k = 1}^{p} p_{λ} (∥ g_{k} ∥) .

(4)

There are two sets of tuning parameters: the L_k’s control the smoothness of the coefficient functions and the λ governs variable selection or sparsity of the model. Selection of these parameters will be discussed in Section 4. In our implementation of the method, we use B-splines as the basis functions. Thus L_k = n_k + d + 1 where n_k is the number of interior knots for g_k and d is the degree of the spline. The interior knots of the splines can be either equally spaced or placed on the sample quantiles of the data so that there are about the same number of observations between any two adjacent knots. We use equally spaced knots for all numerical examples in this paper.

There are many ways to specify the penalty function p_λ(·). If p_λ(u) = λu, then the penalty term in (4) is similar to that in COSSO. However, we will use the SCAD penalty function by Fan and Li (2001), defined as

p_{λ} (u) = {\begin{array}{l} λ u & if 0 \leq u \leq λ, \\ - \frac{(u^{2} - 2 a λ u + λ^{2})}{2 (a - 1)} & if λ < u < a λ, \\ \frac{(a + 1) λ^{2}}{2} & if u \geq a λ . \end{array}

(5)

The penalty function (5) is a quadratic spline function with two knots at λ and aλ, where a is another tuning parameter. Fan and Li (2001) suggested that a = 3.7 is a reasonable choice, which was also adopted in this paper. Use of the SCAD penalty allows us to obtain nice theoretical properties such as consistent variable selection and the oracle property for the proposed method.

For g_k(t) = Σ_l γ_klB_kl(t) ∈ Inline graphic , its squared L₂-norm can be written as a quadratic form in γ_k = (γ_k₁,…,γ_{kL_k})^T. Let R_k = (r_ij) _{L_k× L_k} be a matrix with entries $r_{i j} = \int_{T} B_{k i} (t) B_{k j} (t) d t$ . Then $∥ g_{k} ∣ ∣^{2} = γ_{k}^{T} R_{k} γ_{k} ≜ ∥ γ_{k} ∣ ∣_{k}^{2}$ . Set $γ = {(γ_{1}^{T}, \dots, γ_{p}^{T})}^{T}$ . The penalized weighted least squares criterion (4) can be written as

p l (γ) = \frac{1}{n} \sum_{i = 1}^{n} w_{i} \sum_{j = 1}^{J_{i}} {(y_{i} (t_{i j}) - \sum_{k = 1}^{p} \sum_{l = 1}^{L_{k}} γ_{k l} x_{i}^{(k)} (t_{i j}) B_{k l} (t_{i j})}^{2} + \sum_{k = 1}^{p} p_{λ} (∥ γ_{k} ∣ ∣_{k}) .

(6)

To express the criterion function (6) using vectors and matrices, we define

B (t) = (\begin{matrix} B_{11} (t) & \dots & B_{1 L_{1}} (t) & 0 & \dots & 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & 0 \\ 0 & \dots & 0 & 0 & \dots & 0 & B_{p 1} (t) & \dots & B_{p L_{p}} (t) \end{matrix}),

U_i(t_ij) = (x_i(t_ij)^TB(t_ij))^T, U_i = (U_i(t_i₁),…,U_i(t_{iJ_i}))^T, and $U = {(U_{1}^{T}, \dots, U_{n}^{T})}^{T}$ . We also define y_i = (y_i(t_i₁),…,y_i(t_{iJ_i}))^T and $y = {(y_{1}^{T}, \dots, y_{n}^{T})}^{T}$ . The criterion function (6) can be rewritten as

\begin{array}{l} p l (γ) = \frac{1}{n} \sum_{i = 1}^{n} w_{i} {(y_{i} - U_{i} γ)}^{T} (y_{i} - U_{i} γ) + \sum_{k = 1}^{p} p_{λ} (∥ γ_{k} ∣ ∣_{k}) \\ = {(y - U γ)}^{T} W (y - U γ) + \sum_{k = 1}^{p} p_{λ} (∥ γ_{k} ∣ ∣_{k}), \end{array}

(7)

where W is a diagonal weight matrix with w_i repeated n_i times.

Remark

The penalized weighted least squares criterion here does not take into account the within-subject correlation typically present in the longitudinal data, because the correlation structure is usually unknown a priori. A popular method for dealing with within-subject correlation is to employ a working correlation structure as in the generalized estimation equations (GEE, Zeger and Liang, 1996). The idea of GEE can be easily adapted into our methodolgy. Specifically, the weight matrix W in (7) can be constructed using a working correlation structure, such as AR or compound symmetry. Actually the criteria (4) and (6) correspond to a working independence correlation structure. Using working independence correlation won’t affect the consistency of variable selection (Section 5). On the other hand, using an appropriate working correlation structure may yield more efficient function estimation if the structure is correctly specified.

3 Computational Algorithm

Because of non-differentiability of the penalized loss (7), the commonly-used gradient-based optimization method is not applicable. In this section, we develop an iterative algorithm using local quadratic approximation of the non-convex penalty p_λ(||γ_k||₂). Following Fan and Li (2001), in a neighborhood of a given positive u₀ ∈ ℝ⁺,

p_{λ} (u) \approx p_{λ} (u_{0}) + \frac{1}{2} {p_{λ}^{'} (u_{0}) / u_{0}} (u^{2} - u_{0}^{2}) .

(8)

In our algorithm, a similar quadratic approximation is used by substituting u with ||γ_k||_k in (8), k = 1,…,p. Given an initial value of $γ_{k}^{0}$ with $∥ γ_{k}^{0} ∣ ∣_{k} > 0$ , p_λ(||γ_k||_k) can be approximated by a quadratic form

p_{λ} (∥ γ_{k}^{0} ∣ ∣_{k}) + \frac{1}{2} {p_{λ}^{'} (∥ γ_{k}^{0} ∣ ∣_{k}) / ∥ γ_{k}^{0} ∣ ∣_{k}} {γ_{k}^{T} R_{k} γ_{k} - {(γ_{k}^{0})}^{T} R_{k} γ_{k}^{0}} .

As a consequence, removing an irrelevant constant, the penalized loss (7) becomes

p l (γ) = \frac{1}{n} {(y - U γ)}^{T} W (y - U γ) + \frac{1}{2} γ^{T} Ω_{λ} (γ^{0}) γ,

(9)

where $Ω_{λ} (γ^{0}) = diag {(p_{λ}^{'} (∥ γ_{1}^{0} ∣ ∣_{2}) / ∥ γ_{1}^{0} ∣ ∣_{2}) R_{1}, \dots, (p_{λ}^{'} (∥ γ_{K}^{0} ∣ ∣_{2}) / ∥ γ_{K}^{0} ∣ ∣_{2}) R_{p}}$ where R_k are defined above equation (6). This is a quadratic form whose minimizer satisfies

{U^{T} WU + \frac{n}{2} Ω_{λ} (γ^{0})} γ = U^{T} Wy .

(10)

The above discussion leads to the following algorithm:

Step 1: Initialize γ = γ⁽¹⁾.

Step 2: Given γ⁽^m⁾, update γ to γ⁽^m⁺¹⁾ by solving equation (10) where γ⁰ is set to be γ⁽^m⁾

Step 3: Iterate Step 2 until convergence of γ.

The initial estimation of γ in Step 1 can be obtained using a ridge regression, which substitutes p_λ(||γ_k||₂) in the penalized loss (7) with the quadratic function $∥ γ_{k} ∣ ∣_{2}^{2}$ . The ridge regression has a closed form solution. At any iteration of Step 2, if some $∥ γ_{k}^{(m)} ∣ ∣_{2}$ is smaller than a cutoff value ε > 0, we set γ̂_k = 0 and treat x⁽^k⁾(t) as irrelevant. In our implementation of the algorithm ε is set to 10⁻³.

Remark

Due to the non-convexity of pl(γ), the algorithm sometimes may fail to converge and the search falls into an infinite loop surrounding several local minimizers. In this situation, we choose the local minimizer that gives the smallest value of pl(γ).

4 Selection of Tuning Parameters

To implement the proposed method, we need to choose the tuning parameters: L_k, k = 1,…,p and λ, where L_k controls the smoothness of β̂_k(t), while λ determines the variable selection. In section 5, we will show that our method is consistent in variable selection when these tuning parameters grow or decay at a proper rate with n. In practice, however, we need a data-driven procedure to select the tuning parameters. We propose to use the “leave-one-observation-out” cross-validation (CV) when the errors are independent and use the “leave-one-subject-out” cross-validation for correlated errors.

Below we develop some computational shortcuts using suitable approximations to prevent actual computation of leave-one-out estimates. Although the derivation of model selection criteria generally holds, we focus our presentation on the situation when L_k = L for all β_k(t), k = 1,…, p. To simplify our presentation, we also set W = I; extension to general W is straightforward, simply replacing y with W^1/2y and U with W^1/2U in the derivation below. At the convergence of our algorithm, the nonzero components are estimated as γ̂ = {U^TU + (n/2)Ω_λ(γ̂)}⁻¹U^Ty, which can be considered as the solution of the ridge regression

∥ y - U γ ∣ ∣_{2}^{2} + \frac{n}{2} γ^{T} Ω_{λ} (\hat{γ}) γ .

(11)

The hat matrix of this ridge regression is denoted by H(L, λ) = U{U^TU+(n/2) Ω_λ(γ̂)}⁻¹U^T. The fitted y from this ridge regression satisfies ŷ = H(L, λ)y.

When the errors ε_ij are independent for different t_ij, j = 1,…,J_i, leave-one-observation-out CV is a reasonable choice for selecting tuning parameters. For the ridge regression (11), computation of leave-one-out estimates for CV can be avoided (Hastie and Tibshirani, 1993). The CV criterion has a closed form expression

C V_{1} (L, λ) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{J_{i}} \frac{{{(y_{i} (t_{i j}) - U_{i}^{T} (t_{i j}) \hat{γ}}}^{2}}{{[1 - {H (L, J)}_{(i j), (i j)}]}^{2}},

(12)

where {H(L; λ)}₍_ij_);(_ij₎ is the diagonal element of H(L, λ) that corresponds to the observation at time t_ij. The corresponding GCV criterion, which replaces the diagonal elements in the above CV formula by their average, is

G C V_{1} (L, λ) = \frac{1}{n} \frac{∥ y - H (L, λ) y ∣ ∣_{2}^{2}}{{[1 - t r {H (L, λ)} / n]}^{2}} .

(13)

When the errors ε_ij are not independent and the correlation structure of ε_i(t) is unknown, the leave-one-observation-out CV is unsuitable, we need to use leave-one-subject-out CV (Rice and Silverman, 1991; Hoover et al., 1998; and Huang et al., 2002). Let γ̂⁽⁻ⁱ⁾ be the solution of (6) after deleting the ith subject. The leave-one-subject-out CV can be written as

C V_{2} (L, λ) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{J_{i}} {y_{i} (t_{i j}) - U_{i}^{T} (t_{i j}) {\hat{γ}}^{(- i)}}^{2} .

(14)

The CV₂(L, λ) can be viewed as an estimate of the true prediction error. However, computation of CV₂ is intensive, since it requires minimization of (6) n times. To overcome this difficulty, we propose to minimize a delete-one version of (11) instead of (6) in computing CV₂, where γ̂ in Ω_λ(γ̂) of (11) is the estimate based on the whole dataset. The approximated CV error is

A C V_{2} (L, λ) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{J_{i}} {y_{i} (t_{i j}) - U_{i}^{T} (t_{i j}) {\hat{γ}}^{★ (- i)}}^{2} = \sum_{i = 1}^{n} ∥ y_{i} - U_{i} {\hat{γ}}^{★ (- i)} ∣ ∣_{2}^{2},

(15)

where γ̂^★(−ⁱ⁾ is obtained by solving (11) instead of (6), deleting the ith subject. It turns out a computation shortcut is available for (15) as we discuss next.

The computation shortcut relies on the following “leave-one-subject-out” formula, the proof of which is given in Appendix A.

Lemma 1

Define ${\tilde{y}}^{(i)} = (y_{1}^{T}, \dots, y_{i - 1}^{T}, U_{i} {\hat{γ}}^{★ (- i)}, y_{i + 1}^{T}, \dots, y_{n}^{T})$ , and let γ̃⁽ⁱ⁾ be the minimizer of (11) with y substituted by ỹ⁽ⁱ⁾. Then, U_iγ̂^★(−i) = U_iγ̂⁽ⁱ⁾.

Let A = {U^TU +(n/2)Ω_λ(γ̂)}⁻¹U^T. Partition A = (A₁,…,A_n) into blocks of columns with each block corresponding to one subject. Then

\hat{γ} = Ay = \sum_{i = 1}^{n} A_{i} y_{i}

and

{\tilde{γ}}^{(i)} = A {\tilde{y}}^{(i)} = \sum_{k \neq i} A_{k} y_{k} + A_{i} U_{i} {\hat{γ}}^{★ (- i)} = \hat{γ} - A_{i} y_{i} + A_{i} U_{i} {\hat{γ}}^{★ (- i)} .

As a consequence of Lemma 1,

U_{i} {\hat{γ}}^{★ (- i)} = U_{i} {\tilde{γ}}^{(i)} = U_{i} \hat{γ} - U_{i} A_{i} (y_{i} - U_{i} {\hat{γ}}^{★ (- i)}) .

It follows that

y_{i} - U_{i} {\hat{γ}}^{★ (- i)} = {(I - U_{i} A_{i})}^{- 1} (y_{i} - U_{i} \hat{γ}) = {I - H_{i i} (L, λ)}^{- 1} (y_{i} - U_{i} \hat{γ}) .

Therefore,

A C V_{2} (L, λ) = \frac{1}{n} \sum_{i = 1}^{n} ∥ {I - H_{i i} (L, λ)}^{- 1} (y_{i} - U_{i} \hat{γ}) ∣ ∣_{2}^{2} .

(16)

where H_ii(L, λ) is the (i, i)-th diagonal block corresponding to observations for the i-th subject. Since J_i is usually not a big number, inversion of the J_i-dimensional matrices I − H_ii won’t create a computation burden in the evaluation of (16). Using (16) as an approximation of the leave-one-subject-out CV helps us avoid actual computation of many delete-one estimates. The quality of the approximation is illustrated in Figure 1 using a dataset from the simulation study reported in Section 6.

*Illustration of the approximation of CV scores for one simulated dataset for different values of L and λ. The ACV given in*(16) *approximates the leave-one-subject-out CV in* (14) *well.*

5 Large Sample Properties

For parametric regression models, Fan and Li (2001) established the large sample properties of the SCAD penalized estimates. They showed that the SCAD penalty enables consistent variable selection and has an oracle property for parameter estimation | the estimates of the nonzero regression coefficients behave as if the subset of relevant variables is already known. We show that similar large sample properties hold for our proposed method for the varying coefficients models. Note that some care needs to be taken in developing an asymptotic theory for a nonparametric setting. For example, the rate of convergence of a nonparametric estimate is not root-n. The approximation error due to use of basis expansion needs also to be carefully studied. For simplicity we shall only present our asymptotic results for working independence case. Technical proofs of all asymptotic results are given in the Appendix.

We focus our asymptotic analysis for Inline graphic being a space of spline functions. Extension to general basis expansions can be obtained by combining the technical arguments in this paper with those in Huang et al. (2002). Define L_n = max_1≤_k_≤_p L_k and ρ_n = max_1≤_k_≤_p inf_g_∈ ||β_k − g||L_∞. Thus, ρn characterizes the approximation error due to spline approximation. Assume that only s predictors are relevant in the Model (1). Without loss of generality, let β_k(t), k = 1,…,s, be the non-zero coefficient functions, and β_k(t) ≡ 0, k = s + 1,…,p.

We make the following technical assumptions:

(C1) The response and covariate processes (y_i(t); x_i(t)), i = 1,…, n, are i.i.d. as (y(t), X(t)), and the observation time points t_ij are i.i.d. from an unknown density f (t) on [0, T ], where f(t) are uniformly bounded away from infinity and zero.

(C2) The eigenvalues of the matrix E{X(t)X^T(t)} are uniformly bound away from zero and infinity for all t.

(C3) There exists a positive constant M₁ such that |X_k(t)| ≤ M₁ for all t and 1 ≤ k ≤ p.

(C4) There exists a positive constant M₂ such that E{ε²(t)} ≤ M₂ for all t.

(C5) lim sup_n(max_k L_k/min_k L_k) < ∞.

(C6) The process ε(t) can be decomposed as the sum of two independent stochastic processes, that is, ε(t) = ε⁽¹⁾(t) + ε⁽²⁾(t), where ε⁽¹⁾ is an arbitrary mean zero process and ε⁽²⁾ is a process of measurement errors that are independent at different time points and have mean zero and constant variance σ².

The same set of assumptions has been used in Huang et al. (2004). Our results can be extended to cases with deterministic observation times; the assumption on independent observation times in (C1) can also be relaxed (Remarks 3.1 and 3.2 of Huang et al., 2004).

Define $r_{n} = n^{- 1} {[\sum_{i = 1}^{n} w_{i}^{- 2} {L_{n} J_{i} + J_{i} (J_{i} - 1)}]}^{1 / 2}$ . When w_i = 1/J_i, or w_i = 1/J_i with J_i uniformly bounded, we can show that r_n ≍ (L_n/n)^1/2

Theorem 1

Suppose Conditions (C1)–(C5) hold, lim_n→∞ ρ_n = 0 and

lim_{n \to \infty} n^{- 1} L_{n} log (L_{n}) = 0.

(17)

Then, with a choice of λ_n such that λ_n → 0 and λ_n/max{r_n, ρ_n} → ∞, we have

β̂_k = 0, k = s + 1,…,p, with probability approaching 1.
||β̂_k − β_k||_L₂ = O_p(max(r_n, ρ_n)), k = 1,…,s.

Part (a) of Theorem 1 says that the proposed penalized least squares method is consistent in variable selection, that is, it can identify the zero coefficient functions with probability tending to 1. Part (b) of the the theorem provides the rate of convergence in estimating the non-zero coefficient functions.

Corollary 1

Suppose that Conditions (C1)–(C4) hold and that β_k(t) have bounded second derivatives, k = 1,…,s, and β_k(t) = 0, k = s + 1,…,p. Let Inline graphic be a space of splines with degree no less than 1 and with L_k equally spaced interior knots, where L_k ≍ n^1/5, k = 1,…,p. If λ_n → 0 and n^2/5 λ_n → ∞, then,

β̂_k = 0, k = s + 1,…,p, with probability approaching 1.
||β̂_k − β_k||_L₂ = O_p(n^−2/5), k = 1,…, s.

Note that the rate of convergence given in part (b) of Corollary 1 is the optimal rate for nonparametric regression with independent, identically distributed data under the same smoothness assumption on the unknown function (Stone, 1982).

Now we consider the asymptotic variance of proposed estimate. Let β⁽¹⁾ = (β₁,…, β_s)^T denote the vector of non-zero coefficient functions and let β̂⁽¹⁾ = (β̂₁,…,β̂_s)^T its estimate obtained by minimizing (6). Let ${\hat{γ}}^{(1)} = {(γ_{1}^{T}, \dots, γ_{s}^{T})}^{T}$ . Let U⁽¹⁾ denote the selected columns of U corresponding to β⁽¹⁾ and similarly let $Ω_{λ}^{(1)}$ denote the selected diagonal blocks of Ω_λ. By part (a) of Theorem 1, with probability tending to one, γ̂_k, k = s + 1,…,p, are vectors of 0’s. Thus the quadratic approximation (9) yields

{\hat{γ}}^{(1)} = {{(U^{(1)})}^{T} W U^{(1)} + \frac{n}{2} Ω_{λ_{n}}^{(1)} ({\hat{γ}}^{(1)})}^{- 1} {(U^{(1)})}^{T} Wy .

(18)

Let $U_{i}^{(1)}$ be the rows of U⁽¹⁾ corresponding to subject i. Let

\begin{array}{l} H^{(1)} = {(U^{(1)})}^{T} W U^{(1)} + \frac{n}{2} Ω_{λ}^{(1)} ({\hat{γ}}^{(1)}) \\ = \sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} U_{i}^{(1)} + \frac{n}{2} Ω_{λ}^{(1)} ({\hat{γ}}^{(1)}) . \end{array}

(19)

The asymptotic variance of γ̂⁽¹⁾ is

\begin{array}{l} Avar ({\hat{γ}}^{(1)}) = {(H^{(1)})}^{- 1} {(U^{(1)})}^{T} W cov (y) W U^{(1)} {(H^{(1)})}^{- 1} \\ = {(H^{(1)})}^{- 1} \sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} cov (y_{i}) W_{i} U_{i}^{(1)} {(H^{(1)})}^{- 1}, \end{array}

(20)

whose diagonal blocks give Avar(γ̂_k), k = 1,…,s. Since β̂⁽¹⁾(t) = (B⁽¹⁾)^T(t)γ̂⁽¹⁾ where B⁽¹⁾(t) are the first s rows of B(t), we have that Avar {β̂⁽¹⁾(t)} = (B⁽¹⁾)^T(t)Avar(γ̂⁽¹⁾)B⁽¹⁾(t), the diagonal elements of which are Avar{β̂_k(t)}, k = 1,…,s.

Let β̃_k(t) = E{β̂_k(t)} be the conditional mean given all observed X’s. Let β̃⁽¹⁾(t) = (β̃₁(t),… β̃_s(t)). For positive definite matrices A, let A ^1/2 denote the unique square root of A and let A^−1/2 = (A⁻¹)^1/2. Let Avar^*{β̂⁽¹⁾(t)} denote a modification of Avar{β̂⁽¹⁾(t)} by replacing Ω_{λ_n} in (19) with 0.

Theorem 2

Suppose Conditions (C1)–(C6) hold, lim_{n → ∞} ρ_n = 0, lim_{n→ ∞} L_n max_i J_i/n = 0, and (17) holds. Then, as n → ∞, {Avar^* (β̂⁽¹⁾(t))}^−1/2{β̂⁽¹⁾(t) − β̃⁽¹⁾(t)} → N(0, I) in distribution, and in particular, {Avar^* (β̂_k(t))}^−1/2{β̂_k(t) − β̃_k(t)} → N (0, 1), k = 1,…, s.

Note that Avar^*{β̂⁽¹⁾(t)} is exactly the same asymptotic variance of the non-penalized weighted least squares estimate using only those covariates corresponding to nonzero coefficient functions (Theorem 3, Huang et al., 2004). Theorem 2 implies that our penalized least squares estimate has the so-called oracle property — the joint asymptotic distribution of estimates of nonzero coefficient functions are the same as that of the non-penalized least squares estimate that utilizes the information about zero coefficient functions.

Theorem 2 can be used to construct pointwise asymptotic confidence intervals for β̃_k(t), k = 1,…,s. Theorem 4 and Corollary 1 of Huang et al. (2004) discuss how to bound the biases β̃_k(t) − β_k(t) and under what conditions the biases are negligible relative to variance. Those results are applicable to the current context. When constructing the confidence intervals, in order to improve finite sample performance, we still use (19) and the sandwich formula (20) to calculate the variance. In actual calculation, we need replace cov(y_i) in (19) by its estimate $e_{i} e_{i}^{T}$ where e_i is the residual vector for subject i.

6 Monte Carlo Simulation

We conducted simulation studies to assess the performance of the proposed procedure. In each simulation run, we generated a simple random sample of 200 subjects according to the model used in Huang et al. (2002), which assumes

Y (t_{i j}) = β_{0} (t_{i j}) + \sum_{k = 1}^{23} β_{k} (t_{i j}) x_{k} (t_{i j}) + τ ε (t_{i j}), i = 1, \dots, 200, j = 1, \dots, J_{i} .

The first three variables x_i(t), i = 1,…,3, are the true relevant covariates, which are simulated the same way as in Huang et al. (2002): x₁(t) is sampled uniformly from [t/10, 2+ t/10] at any given time point t; x₂(t), conditioning on x₁(t), is Gaussian with mean zero and variance (1 + x₁(t))/(2 + x₁(t)); x₃(t), independent of x₁ and x₂, is a Bernoulli random variable with success rate 0.6. In addition to x_k, k = 1, 2, 3, 20 redundant variables x_k(t), k = 4,…,23, are simulated to demonstrate the performance of variable selection, where each x_k(t), independent of each other, is a random realization of a Gaussian process with covariance structure cov(x_k(t); x_k(s)) = 4 exp(− |t − s|). The random error ε(t) is given by Z(t) + E(t), where Z(t) has the same distribution as x_k(t), k = 4,…,23, and E(t) are independent measurement errors from N(0, 4) distribution at each time t. The parameter τ in the model is used to control the signal-to-noise ratio of the model. We consider models with τ = 1.00, 4.00, 5.66 and 8.00. The coefficients β_k(t), k = 0,…,3, corresponding to the constant term and the first three variables, are given by

\begin{matrix} β_{0} (t) = 15 + 20 sin (\frac{π t}{60}), β_{1} (t) = 2 - 3 cos (\frac{π (t - 25)}{15}), \\ β_{2} (t) = 6 - 0.2 t, β_{3} (t) = - 4 + \frac{{(20 - t)}^{3}}{2000}, \end{matrix}

(see solid lines of Figure 2) while the remaining coefficients, corresponding to the irrelevant variables, are given by β_k(t) = 0, k = 4,…, 23. The observation time points t_ij are generated following the same scheme as in Huang et al. (2002), where each subject has a set of “scheduled” time points {1,…,30}, and each scheduled time has a probability of 60% of being skipped. Then, the actual observation time t_ij is obtained by adding a random perturbation from Uniform(− 0.5, 0.5) to the nonskipped scheduled time.

Simulation study for τ = 1.00. True (solid lines) and average of estimated (dashed lines) time-varying coefficients β₀(t) (a), β₁(t) (b), β₂(t) (c) and β₃(t) (+/− 1 point-wise SD) over 500 replications.

For each simulated dataset, the penalized least squares criterion (4) was minimized over a space of cubic splines with equally spaced knots. The number of knots and the tuning parameter λ were selected by minimizing the ACV₂ criterion in (15) as described in Section 4 (see Figure 1 for an example). We repeated the simulations 500 times. Table 1 shows the results of variable selection of our proposed procedure for four different models with different levels of noise variances. Clearly the level of noise had a great effect on how well the proposed method selected the exact correct models. When the noise level was low (τ = 1.00), the method selected the exact correct model in 79% of the replications. In addition, out of these 500 replications, the variables 1–3 were selected in each of the runs. In contrast, the number of times each of the 20 irrelevant variables were selected among the 500 simulation runs ranged from 3 to 12 with an average of 7.3 times (i.e., 1.46%). However, when the noise level was high (τ = 8.00), the method selected the exact true model in only 26% of the replications and the average number of times that the three relevant variables were selected was reduced to 2.24. However, it is interesting to note that the average number of times that the irrelevant variables were selected remained to be small (See Aver.I in Table 1). The simulations indicate that the proposed procedure indeed provides an effective method for selecting variables with time-varying coefficients.

Table 1.

Simulation results for models with different noise levels (τ) based on 500 replications. Perc.T: percentage of replications that the exact true model was selected; Aver.R: average of the numbers of the three relevant variables that were selected in the model; Aver.I: average of the numbers of the irrelevant variables that were selected in the model; Aver.Bias²: average of the squared biases; Aver.Var: average of the empirical variances; 95%Cov.Prob: 95% coverage probability based on the replications when the true models were selected. For Aver.Bias², Aver.Var and 95%Cov.Prob, all values were averaged over a grid of 300 points and 500 replications. Numbers in parentheses are results of the oracle estimators that assume the true models are known.

	Noise level (τ)
	1.00	4.00	5.66	8.00
	Variable selection
Perc.T	0.79	0.68	0.48	0.26
Aver.R	3.00	2.94	2.73	2.24
Aver.I	0.29	0.47	0.55	0.54
	Aver.Bias²
β₀(t)	0.021 (0.023)	0.39 (0.37)	1.60 (0.42)	4.55 (0.42)
β₁(t)	0.003 (0.004)	0.067 (0.067)	0.29 (0.076)	0.97 (0.073)
β₂(t)	0.000089 (0.0002)	0.0022 (0.0022)	0.0047 (0.0043)	0.026 (0.0089)
β₃(t)	0.00023 (0.0004)	0.0079 (0.0060)	0.061 (0.010)	0.88 (0.02)
	Aver.Var
β₀(t)	0.98 (0.95)	14.11 (12.53)	28.29 (23.98)	49.09 (47.25)
β₁(t)	0.12 (0.11)	1.95 (1.52)	4.08 (2.84)	7.15 (5.52)
β₂(t)	0.049 (0.047)	0.70 (0.67)	1.37 (1.31)	3.06 (2.63)
β₃(t)	0.15 (0.15)	2.13 (2.01)	4.70 (3.96)	9.98 (7.88)
	95%Cov.Prob
β₀(t)	0.929	0.935	0.937	0.926
β₁(t)	0.924	0.929	0.930	0.929
β₂(t)	0.936	0.940	0.946	0.942
β₃(t)	0.944	0.949	0.947	0.949

Open in a new tab

To examine how well the method estimates the coefficient functions, Figure 2 shows the estimates of the time-varying coefficients of β_k(t), k = 0, 1, 2, 3 for the model with τ = 1.00, indicating that the estimates fit the true function very well. Table 1 shows the averages of the squared biases and the empirical variances of the estimates of β_k(t), k = 0, 1, 2, 3 averaged over a grid of 300 points. In general, the biases were small, except for the estimates of β₀(t) when the noise level was high, in which case the proposed method did not select the variables well. Similarly, the variances of the estimates of β₀(t) were also relatively larger due to the fact that the magnitude of true β₀(t) is larger than the other three functions. As a comparison, we also calculated the biases and the empirical variances of the estimates based on the true models (i.e., the oracle estimators, see Table 1). Again, we observed that the both biases and empirical variances were very comparable to the oracle estimators when the noise level was low. When the noise level was high, due to the fact that the method tends to select fewer relevant variables, the biases and the empirical variances became large.

Finally, we examined the coverage probabilities of the theoretical point-wise confidence intervals based on Theorem 2. For each model, the empirical coverage probabilities averaged over a grid of 300 points were presented in Table 1 for β₀(t), β₁(t), β₂(t) and β₃(t), respectively, for nominal 95% point-wise confidence intervals. The slight under-coverage is caused by bias: the GCV criterion tried to balance the bias and variance. When we used the number of knots larger than that suggested by GCV (under-smoothing), the empirical coverage probabilities got closer to the nominal level. For example, for the model with τ = 1, the average empirical coverage probabilities became 0.946, 0.949, 0.942 and 0.947 when the number of interior knots were fixed at 5.

7 Applications to Real Datasets

To demonstrate the effectiveness of the proposed methods in selecting the variables with time-varying effects and in estimating the time-varying effects, we present in this Section results from the analysis of two real datasets, including a longitudinal AIDS dataset (Kaslow et al., 1987) and a repeated measurement microarray time course gene expression dataset (Spellman et al., 1998).

7.1 Application to AIDS data

Kaslow et al. (1987) reported a Multicenter AIDS Cohort Study to obtain the repeated measurements of physical examinations, laboratory results and CD4 cell counts and percentages of the homosexual men who became HIV-positive during 1984 and 1991. All individuals were scheduled to have their measurements made at semi-annual visits, but because many individuals missed some of their scheduled visits and the HIV infections happened randomly during the study, there were unequal numbers of repeated measurements and different measurement times per individual. As a subset of the cohort, our analysis focuses on the 283 homosexual men who become HIV-positive and aims to evaluate the effects of cigarette smoking, pre-HIV infection CD4 percentage, and age at HIV infection on the mean CD4-percentage after infection. In this subset, the number of repeated measurements per subject ranged from 1 to 14, with a median of 6 and a mean of 6.57. The number of distinct measurement time points was 59. Let t_ij be the time in years of the jth measurement of the ith individual after HIV infection, y_ij be the ith individual’s CD4 percentage at time t_ij, and x_i₁ be the smoking status of the ith individual, taking a value of 1 or 0 for smoker or non-smoker. In addition, let x_i₂ be the centered age at HIV infection for the ith individual and x_i₃ be the centered pre-infection CD4 percentage. We assume the following VC model for y_ij,

y_{i j} = β_{0} (t_{i j}) + \sum_{k = 1}^{3} x_{i k} β_{k} (t_{i k}) + ε_{i j},

where β₀(t) is the baseline CD4 percentage, representing the mean CD4 percentage t years after the infection for a non-smoker with average pre-infection CD4 percentage and average age at HIV infection, and β₁(t), β₂(t), β₃(t) measures the time-varying effects for cigarette smoking, age at HIV infection and pre-infection CD4 percentage, respectively, on the post-infection CD4 percentage at time t.

Our procedure identified two non-zero coefficients, β₀(t) and β₃(t), indicating that cigarette smoking and age of HIV infection do not have any effects on the poster-infection CDS4 percentage. Figure 3 shows the fitted coefficient functions (solid curves) and their 95% point-wise confidence intervals, indicating that the baseline CD4 percentage of the population decreases with time and pre-infection CD4 percentage appears to be positively associated with high post-infection CD4 percentage. These results agreed with those obtained by the local smoothing methods of Wu and Chiang (2002) and Fan and Zhang (2002) and by the basis function approximation method of Huang et al. (2002).

Application to AIDS data. Estimated coefficient functions for intercept (a) and pre-infection CD4 percentages (b) (solid curves) and their point-wise 95% confidence intervals (dashed lines).

To assess our method in more challenging cases, we added 30 redundant variables, which are generated by randomly permuting the values of each predictor variable 10 times, and applied our method to the augmented dataset. We repeated the procedure 50 times. For the three observed covariates, “smoking status” was never selected, “age” was selected twice, and “pre-infection CD4 percentage” was always selected, wheras among the 30 redundant variables, two were selected twice, four were selected once, and the rest were never selected.

7.2 Application to yeast cell cycle gene expression data

The cell cycle is a tightly regulated set of processes by which cells grow, replicate their DNA, segregate their chromosomes, and divide into daughter cells. The cell cycle process is commonly divided into G1-S-G2-M stages, where the G1 stage stands for “GAP 1”, the S stage stands for “Synthesis”, during which DNA replication occurs, the G2 stage stands for “GAP 2” and the M stage stands for “mitosis”, which is when nuclear (chromosomes separate) and cytoplasmic (cytokinesis) division occur. Coordinate expression of genes whose products contribute to stage-specific functions is a key feature of the cell cycle (Simon et al., 2001). Transcription factors (TFs) have been identified that have roles in regulating transcription of a small set of yeast cell-cycle regulated genes; these include Mbp1, Swi4, Swi6, Mcm1, Fkh1, Fkh2, Ndd1, Swi5 and Ace2. Among these, Swi6, Swi4 and Mbp1 have been reported to function during the G1 stage; McM1, Ndd1 and Fkh1/2 have been reported to function during the G2 stage; and Ace2, Swi5 and Mcm1 have been reported to function during the M stage (Simon et al., 2001). However, it is not clear where these TFs regulate all cell cycle genes.

We apply our methods to identify the key transcription factors that play roles in the network of regulations from a set of gene expression measurements captured at different time points during the cell cycle. The dataset we use comes from the α-factor synchronized cultures from Spellman et al. (1998). They monitored genome-wide mRNA levels for 6178 yeast ORFs simultaneously using several different methods of synchronization including an α-factor-mediated G1 arrest, which covers approximately two cell-cycle periods with measurements at 7-min intervals for 119 mins with a total of 18 time points. Using a model-based approach, Luan and Li (2003) identified 297 cell-cycle-regulated genes based on the α-factor synchronization experiments. Let y_it denote the log-expression level for gene i at time point t during the cell cycle process, for i = 1, ···, 297 and t = 1, ···, 18. We then applied the mixture model approach (Chen et al., 2007; Wang et al., 2007) using the ChIP data of Lee et al. (2002) to derive the binding probabilities x_ik for these 297 cell-cycle-regulated genes for a total of 96 transcriptional factors with at least one nonzero binding probability in the 297 genes. We assume the following VC model to link the binding probabilities to the gene expression levels

y_{i t} = β_{0} (t) + \sum_{k = 1}^{96} β_{k} (t) x_{i k} + ε_{i t},

where β_k(t) models the transcription effect of the kth TF on gene expression at time t during the cell cycle process. In this model, we assume that gene expression levels are independent conditioning on effects of the relevant transcription factors. In addition, for a given gene i, ∈_its are also assumed to be independent over different time points due to the cross-sectional nature of the experiments. Our goal is to identify the transcriptional factors that might be related to the expression patterns of these 297 cell-cycle-regulated genes. Since different transcriptional factors may regulate the gene expression at different time points during the cell cycle process, their effects on gene expression are expected to be time-dependent.

We applied our method using the GCV defined in (13) for selecting the tuning parameter in order to identify the TFs that affect the expression changes over time for these 297 cell-cycle-regulated genes in the α-factor synchronization experiment. Our procedure identified a total of 71 TFs that are related to yeast cell-cycle processes, including 19 of the 21 known and experimentally verified cell-cycle related TFs, all showing time-dependent effects of these TFs on gene expression levels. In addition, the effects followed similar trends between the two cell cycle periods. The other two TFs, CBF1 and GCN4, were not selected by our procedure. The minimum p-values over 18 time points from simple linear regressions are 0.06 and 0.14, respectively, also indicating that CBF1 and GCN4 were not related to expression variation over time. The 52 additional TFs that were selected by our procedure almost all showed estimated periodic transcriptional effects. The identified TFs include many pairs of cooperative or synergistic pairs of TFs involved in the yeast cell cycle process reported in the literature (Banerjee and Zhang, 2003; Tsai et al., 2005). Of these 52 TFs, 34 of them belong to the cooperative pairs of the TFs identified by Banerjee and Zhang (2003). Overall, the model can explain 43% of the total variations of the gene expression levels.

Figure 4 shows the estimated time-dependent transcriptional effects of nine of the experimentally verified TFs that are known to regulate cell-cycle-regulated genes at different stages. The top panel shows the transcriptional effects of three TFs, Swi4, Swi6 and Mbp1, which regulate gene expression at the G1 phase (Simon et al., 2001). The estimated transcriptional effects of these three TFs are quite similar with peak effects obtained at approximately the time points corresponding to the G1 phase of the cell cycle process. The middle panel shows the transcriptional effects of three TFs, Mcm1, Fkh2 and Ndd1, which regulate gene expression at the G2 phase (Simon et al., 2001). The estimated transcriptional effects of two of these three TFs, Fkh2 and Ndd1, are similar with peak effects obtained at approximately the time points corresponding to the G2 phase of the cell cycle process. The estimated effect of Mcm1 was somewhat different, however Mcm1 is also known to regulate gene expression at the M phase (Simon et al., 2001). The bottom panel of Figure 4 shows the transcriptional effects of three TFs, Swi5, Ace2 and Mcm1, which regulate gene expression at the M phase (Simon et al., 2001), indicating similar transcriptional effects of these three TFs with peak effects at approximately the time points corresponding to the M phase of the cell cycle. These plots demonstrated that our procedure can indeed identify the known cell-cycle-related TFs and point to the stages when these TFs function to regulate the expressions of cell cycle genes. As a comparison, only two of these nine TFs (SWI4 and FKH2) show certain periodic patterns in the observed gene expressions over the two cell-cycle periods, indicating that by only examining the expression profiles, we may miss many relevant transcription factors. Our method incorporates both time-course gene expression and ChIP-chip binding data and enables us to identify more transcription factors that are related to the cell-cycle process.

Application to yeast cell cycle gene expression data. Estimated time-varying coefficients for selected transcription factors (TFs)(solid lines) and point-wise 90% confidence intervals (dotted lines). Top panel, TFs that regulate genes expressed at the G1 phase; middle panel, TFs that regulate genes expressed at the G2 phase; bottom panel, TFs that regulate genes expressed at the M phase.

Finally, to assess false identifications of the TFs that are related to a dynamic biological procedure, we randomly assigned the observed gene expression profiles to the 297 genes while keeping the values and the orders of the expression values for a given profile unchanged and applied our procedure again to the new datasets. The goal of such permutations was to scramble the response data and to create new expression profiles that do not depend on the motif binding probabilities in order to see to what extent the proposed procedure identifies the spurious variables. We repeated this procedure 50 times. Among the 50 runs, 5 runs selected 4 TFs, 1 run selected 3 TFs, 16 runs selected 2 TFs and the rest of the 28 runs did not select any of the TFs, indicating that our procedure indeed selects the relevant TFs with few false positives.

8 Discussion

We have proposed a regularized estimation procedure for nonparametric varying-coefficient models that can simultaneously perform variable selection and estimation of smooth coefficient functions. The procedure is applicable to both longitudinal and functional responses. Our method extends the SCAD penalty of Fan and Li (2001) from parametric settings to a nonparametric setting. We have shown that the proposed method is consistent in variable selection and has the oracle property that the asymptotic distribution of the nonparametric estimate of a smooth function with variable selection is the same as that when knowing which variables are in the model. This oracle property has been obtained for parametric settings, but our result seems to be the first of its kind in a nonparametric setting. A computation algorithm is developed based on local quadratic approximation to the criterion function. Simulation studies indicated that the proposed procedure is very effective in selecting relevant variables and in estimating the smooth regression coefficient functions. Results from application to the yeast cell cycle dataset indicates that the procedure can be effective in selecting the transcription factors that potentially play important roles in the regulation of gene expressions during the cell cycle process.

We need regularization for two purposes in our method: controlling the smoothness of the function estimates and variable selection. In our approach, the number of basis functions/the number of knots of splines is used to control the smoothness, while the SCAD parameter is used to set the threshold for variable selection. Thus, the two tasks of regularization are disentangled. To achieve finer control of smoothness, however, it is desirable to use the roughness penalty for introducing smoothness in function estimation, as in penalized splines or smoothing splines. Following the idea of COSSO, we could replace Σ_k p_λ(||g_k||) in (4) by λΣ_k ||g_k||_W where ||·||_W denotes a Sobolev norm, and then minimize the modified criterion over the corresponding Sobolev space. However, such an extension of COSSO is not very attractive compared with our method, because a single tuning parameter λ serves two different purposes of regularization. It is an interesting open question to develop a unified penalization method that does both smoothing and variable selection but also has the two tasks decoupled in some manner. Recent results on regularization methods with a diverging number of parameters (Fan and Peng, 2004; Huang et al., 2008) may provide theoretical insights into such a procedure.

As we explained in Section 2, our methodology applies when a working correlation structure is used to take into account within-subject correlation for longitudinal data. We only presented asymptotic results for the working independence case, however. Extension of these results to parametrically estimated correlation structures is relatively straightforward, using the arguments in Huang, Zhang and Zhou (2007); one needs to assume that the ratio of the smallest and largest eigenvalues of the weighting matrix W is bounded. Extension to nonparametrically estimated correlation structures (Wu and Pourahmadi, 2003; Huang et al., 2006; Huang, Liu and Liu, 2007) would be more challenging. Careful study of the asymptotics of our methodology for general working correlation structures is left for future research.

Acknowledgments

The work of the first two authors was supported by NIH grants ES009911, CA127334 and AG025532. Jian-hua Huang’s work was partly supported by grants from the National Science Foundation (DMS-0606580) and the National Cancer Institute (CA57030). We thank two referees for helpful comments and Mr. Edmund Weisberg, MS at Penn CCEB for editorial assistance.

Appendix

A.1 Proof of Lemma 1

Use proof by contradiction. Suppose U_iγ̂^★⁽⁻ⁱ⁾ ≠ U_iγ̃⁽ⁱ⁾. Denote (11) as pl_ridge(γ, y). Then,

\begin{array}{l} {p l}_{ridge} ({\tilde{γ}}^{(i)}, {\tilde{y}}^{(i)}) = \frac{1}{n} \sum_{k = 1}^{n} ∥ {\tilde{y}}_{k}^{(i)} - U_{k} {\tilde{γ}}_{k}^{(i)} ∣ ∣_{2}^{2} + \frac{1}{2} {({\tilde{γ}}^{(i)})}^{T} \sum_{λ} (\hat{γ}) {\tilde{γ}}^{(i)} \\ > \frac{1}{n} \sum_{k \neq i} ∥ y_{k} - U_{k} {\tilde{γ}}_{k}^{(i)} ∣ ∣_{2}^{2} + \frac{1}{2} {({\tilde{γ}}^{(i)})}^{T} \sum_{λ} (\hat{γ}) {\tilde{γ}}^{(i)} \\ \geq \frac{1}{n} \sum_{k \neq i} ∥ y_{k} - U_{k} {\hat{γ}}_{k}^{* (- i)} ∣ ∣_{2}^{2} + \frac{1}{2} {({\hat{γ}}^{* (- i)})}^{T} \sum_{λ} (\hat{γ}) {\hat{γ}}^{* (- i)} \\ = p l_{ridge} ({\hat{γ}}^{* (- i)}, {\tilde{y}}^{(i)}) . \end{array}

This contradicts the fact that γ̃⁽ⁱ⁾ minimizes pl_ridge(γ̃⁽ⁱ⁾, ỹ⁽ⁱ⁾).

A.2 Useful lemmas

This subsection provides some lemmas that facilitate the proof of Theorems 1 and 2. Define ỹ_i = E(y_i|x_i), $\tilde{γ} = {(\sum_{i = 1}^{n} U_{i}^{T} W_{i} U_{i})}^{- 1} (\sum_{i = 1}^{n} U_{i}^{T} W_{i} {\tilde{y}}_{i})$ , and (β̃(t) = B(t)γ̃. Here β̃(t) can be regarded as the conditional mean of β̂(t). We have a bias-variance decomposition β̂−β = (β̂ − β̃) + (β̃ − β). Lemmas 3 and 5 quantify the rate of the convergence of ||β̃ − β||_L₂ and ||β̂−β̃||_L₂, respectively. The rate given in Lemma 5 on the variance term will be refined in part (b) of Theorem 1.

Lemma 2

Suppose that (17) holds. There exists an interval [M₃, M₄], 0 < M₃ < M₄ < ∞, such that all the eigenvalues of $n^{- 1} L_{n} \sum_{i = 1}^{n} U_{i}^{T} W_{i} U_{i}$ fall in [M₃, M₄] with probability approaching 1 as n → ∞.

Lemma 3

Suppose that (17) holds. Then, ||β̃ − β||_L₂ = O_P (ρ_n).

Lemmas 2 and 3 are from Lemmas A.3 and A.7 in Huang et al. (2004), respectively. The proofs are omitted.

Lemma 4

$n^{- 1} L_{n}^{1 / 2} {sup}_{v} (∣ ε^{T} WUv ∣ / ∣ v ∣) = O_{P} (r_{n})$ , where |v| = ||v|| _l₂

Proof of Lemma 4

By the Cauchy-Schwarz inequality,

sup_{v} {{(\frac{∣ ε^{T} WUv ∣}{∣ v ∣})}^{2}} \leq ε^{T} WU U^{T} W ε .

(21)

The expected value of the right hand side equals

\begin{array}{l} E (ε^{T} WU U^{T} W ε) = E {(\sum_{i = 1}^{n} ε_{i}^{T} W_{i} U_{i}) (\sum_{i = 1}^{n} U_{i}^{T} W_{i} ε_{i})} \\ = \sum_{i = 1}^{n} E (ε_{i}^{T} W_{i} U_{i} U_{i}^{T} W_{i} ε_{i}) . \end{array}

(22)

Note that

\begin{array}{l} E (ε_{i}^{T} W_{i} U_{i} U_{i}^{T} W_{i} ε_{i}) \\ = w_{i}^{- 2} E [\sum_{k, l} {\sum_{j} ε_{i} (t_{i j}) X_{i k} (t_{i j}) B_{k l} (t_{i j})}^{2}] \\ = w_{i}^{- 2} \sum_{k, l} [\sum_{j} E {ε_{i} {(t_{i j})}^{2} X_{i k} {(t_{i j})}^{2} B_{k l} {(t_{i j})}^{2}} \\ + \sum_{j \neq j^{'}} E {ε_{i} (t_{i j}) X_{i k} (t_{i j}) B_{k l} (t_{i j}) ε_{i} (t_{i j^{'}}) X_{i k} (t_{i j^{'}}) B_{k l} (t_{i j^{'}})}] . \end{array}

On the other hand,

\begin{array}{l} E {B_{k l} (t_{i j}) B_{k l} (t_{i j^{'}})} = E {B_{k l} (t_{i j})} E {B_{k l} (t_{i j^{'}})} \\ \leq {sup_{t \in [0, T]} f (t) \int_{0}^{T} B_{k l} (t) d t}^{2} \leq C_{1}^{2} L_{k}^{- 2} \end{array}

and $E {B_{k l} {(t)}^{2}} \leq E {B_{k l} (t)} \leq C_{1} L_{k}^{- 1}$ . Therefore,

E (ε_{i}^{T} W_{i} U_{i} U_{i}^{T} W_{i} ε_{i}) \leq C_{2} w_{i}^{- 2} {J_{i} + J_{i} (J_{i} - 1) L_{n}^{- 1}} .

(23)

Combining (22) and (23) and using the Markov inequality, we obtain

ε^{T} WU U^{T} W ε = O_{P} [\sum_{i = 1}^{n} w_{i}^{- 2} {J_{i} + J_{i} (J_{i} - 1) L_{n}^{- 1}}],

which together with (21) yields the desired result.

Lemma 5

Suppose that (17) holds and that λ_n, r_n, ρ_n → 0 and λ_n/ρ_n → ∞ as n → ∞. Then, ||β̂ − β̃||_L₂ =O_P {r_n + (λ_nρ_n)^1/2}.

Proof of Lemma 5

Using properties of B-spline basis functions (see Section A.2 of Huang et al., 2004), we have

∥ {\hat{β}}_{k} - {\tilde{β}}_{k} ∣ ∣_{L_{2}}^{2} = ∥ {\hat{γ}}_{k} - {\tilde{γ}}_{k} ∣ ∣_{k}^{2} ≍ L_{k}^{- 1} ∥ {\hat{γ}}_{k} - {\tilde{γ}}_{k} ∣ ∣_{l_{2}}^{2} .

Sum over k to obtain

∥ \hat{β} - \tilde{β} ∣ ∣_{L_{2}}^{2} = \sum_{k = 1}^{p} ∥ {\hat{γ}}_{k} - {\tilde{γ}}_{k} ∣ ∣_{k}^{2} ≍ L_{n}^{- 1} ∥ \hat{γ} - \tilde{γ} ∣ ∣_{l_{2}}^{2} .

Let $\hat{γ} - \tilde{γ} = δ_{n} L_{n}^{1 / 2} v$ with δ_n a scaler and v a vector satisfying ||v||_l₂= 1. It follows that ||β̂−β̃||_L₂ ≍ δ_n. We first show that δ_n = O_p(r_n + λ_n) and then refine this rate of convergence in the second step of the proof.

Let ε_i = (ε_i(t_i₁), …, ε_i(t_{iJ_i}))^T, i = 1, …, n, and $ε = {(ε_{1}^{T}, \dots, ε_{n}^{T})}^{T}$ . By the definition of γ̃, U^TW(y − ε − U_γ̃) = 0. Since $\hat{γ} = \tilde{γ} + δ_{n} L_{n}^{1 / 2} v$ ,

\begin{array}{l} {(y - U \hat{γ})}^{T} W (y - U \hat{γ}) - {(y - U \tilde{γ})}^{T} W (y - U \tilde{γ}) \\ = - 2 δ_{n} L_{n}^{1 / 2} v^{T} U^{T} W (y - U \tilde{γ}) + δ_{n}^{2} L_{n} v^{T} U^{T} WUv \\ = - 2 δ_{n} L_{n}^{1 / 2} v^{T} U^{T} W ε + δ_{n}^{2} L_{n} v^{T} U^{T} WUv . \end{array}

Hence,

\begin{array}{l} p l (\hat{γ}) - p l (\tilde{γ}) = - \frac{2 δ_{n} L_{n}^{1 / 2}}{n} v^{T} U^{T} W ε + \frac{δ_{n}^{2} L_{n}}{n} v^{T} U^{T} WUv \\ + \sum_{k = 1}^{p} {p_{λ_{n}} (∥ {\hat{γ}}_{k} ∣ ∣_{k}) - p_{λ_{n}} (∥ {\tilde{γ}}_{k} ∣ ∣_{k})}, \end{array}

(24)

where v_k’s are components of v when the latter is partitioned as γ̃.

By the definition of γ̂, pl(γ̂) − pl(γ̃) ≤ 0. According to Lemma 4,

\frac{L_{n}^{1 / 2}}{n} v^{T} U^{T} W ε = \frac{L_{n}^{1 / 2}}{n} \sum_{i = 1}^{n} {ε_{i}}^{T} W_{i} U_{i} v = O_{P} (r_{n}) .

(25)

By Lemma 2,

\frac{L_{n}}{n} v^{T} U^{T} WUv = \frac{L_{n}}{n} \sum_{i = 1}^{n} v^{T} U_{i}^{T} W_{i} U_{i} v \geq M_{3},

(26)

with probability approaching 1. Using the inequality |p_λ(a) − p_λ(b)| ≤ λ|a − b| we obtain

\sum_{k = 1}^{p} {p_{λ_{n}} (∥ {\hat{γ}}_{k} ∣ ∣_{k}) - p_{λ_{n}} (∥ {\tilde{γ}}_{k} ∣ ∣_{k})} \geq \sum_{k = 1}^{p} - λ_{n} ∥ {\hat{γ}}_{k} - {\tilde{γ}}_{k} ∣ ∣_{k} ≍ - λ_{n} δ_{n} .

(27)

Therefore, $0 \geq - O_{p} (r_{n}) δ_{n} + M_{3} δ_{n}^{2} - λ_{n} δ_{n}$ with probability approaching 1, which implies δ_n = O_p(r_n + λ_n).

Now we proceed to improve the obtained rate and show that δ_n = O_P {r_n + (λ_nρ_n)^1/2}. For k = 1, …, p, we have |||γ̂_k||_k − ||γ̃_k||_k| ≤ ||γ̂_k − γ̃_k||_k = o_p(1) and

∣ ∥ {\tilde{γ}}_{k} ∣ ∣_{k} - ∥ β_{k} ∣ ∣_{L_{2}} ∣ \leq ∣ ∥ {\tilde{β}}_{k} ∣ ∣_{L_{2}} - ∥ β_{k} ∣ ∣_{L_{2}} ∣ \leq ∥ {\tilde{β}}_{k} - β_{k} ∣ ∣_{L_{2}} = O_{p} (ρ_{n}) = o_{P} (1) .

(28)

It follows that ||γ̂_k||_k → ||β_k||_L₂ and ||γ̃_k||_k → ||β_k||_L₂with probability. Since ||β_k||_L₂ > 0 for k = 1, …, s and λ_n → 0, we have that, with probability approaching 1, ||γ̂_k||_k > aλ_n, ||γ̃_k||_k > aλ_n, k = 1, …, s, On the other hand, ||β_k||_L₂ = 0 for k = s + 1, …, p, so (28) implies that ||γ̃_k||_k = O_P (ρ_n). Since λ_n/ρ_n → ∞, we have that with probability approaching 1, ||γ̃_k||_k < λ_n, k = s + 1, …, p. Consequently, by the definition of p_λ(·), P {p_{λ_n}(||γ̃_k||H_k) =p_{λ_n}(||γ̂||_{H_k})} → 1, k = 1, … s, and P{p_{λ_n}(||γ̃_k||_{H_k}) = λ_n(||γ̃_k|| _{H_k}} → 1, k = s + 1, …, p.

Therefore,

\sum_{k = 1}^{p} {p_{λ_{n}} (∥ {\hat{γ}}_{k} ∣ ∣_{k}) - p_{λ_{n}} (∥ {\tilde{γ}}_{k} ∣ ∣_{k})} = λ_{n} \sum_{k = s + 1}^{p} ∥ {\tilde{γ}}_{k} ∣ ∣_{k} \geq - O_{P} (λ_{n} ρ_{n}) .

This result combined with (24), (25) and (26) imply that, with probability approaching 1,

p l (\hat{γ}) - p l (\tilde{γ}) \geq - O_{p} (r_{n}) δ_{n} + M_{3} δ_{n}^{2} - O_{p} (λ_{n} ρ_{n}),

which in turn implies δ_n = O_P {r_n + (λ_nρ_n)^1/2}. The proof is complete.

A.3 Proof of Theorem 1

To prove part (a) of Theorem 1, we use proof by contradiction. Suppose that for n large enough, there exists a constant η > 0 such that with probability at least η, there exists a k₀ > s such that β̂ _k₀(t) ≠ 0. Then ||γ̂_k₀||_k₀ = ||β̂ _k₀ (·)||_L₂ > 0. Let γ̂^* be a vector constructed by replacing γ̂_k₀ with 0 in γ̂. Then,

p l (\hat{γ}) - p l ({\hat{γ}}^{*}) = \frac{1}{n} (∥ y - U \hat{γ} ∣ ∣^{2} - ∥ y - U {\hat{γ}}^{*} ∣ ∣^{2}) + p_{λ_{n}} (∥ {\hat{γ}}_{k_{0}} ∣ ∣_{k_{0}}) .

(29)

By Lemma 5 and the fact that β_k₀ (t) = 0, ||γ̂_k₀||_k₀ = ||β̂_k₀|| = O_p{r_n + (λ_nρ_n)^1/2}. Since λ_n/max(r_n, ρ_n) → ∞, we have that λ_n > ||γ̂_k₀||_k₀ and thus p_{λ_n} (||γ̂_k₀||_k₀) = λ_n||γ̂_k₀||_k₀ with probability approaching 1. To analyze the first term on the right hand side of (29), note that

\begin{array}{l} ∥ y - U \hat{γ} ∣ ∣^{2} - ∥ y - U {\hat{γ}}^{*} ∣ ∣^{2} \\ \geq - {(y - U {\hat{γ}}^{*})}^{T} WU (\hat{γ} - {\hat{γ}}^{*}) \\ = - 2 {(y - U \tilde{γ})}^{T} WU (\hat{γ} - {\hat{γ}}^{*}) - 2 {(\tilde{γ} - {\hat{γ}}^{*})}^{T} U^{T} WU (\hat{γ} - {\hat{γ}}^{*}) . \end{array}

(30)

By the Cauchy-Schwarz inequality and Lemma 2,

\begin{array}{l} {(\tilde{γ} - {\hat{γ}}^{*})}^{T} (U^{T} WU) (\hat{γ} - {\hat{γ}}^{*}) \\ \leq {{(\tilde{γ} - {\hat{γ}}^{*})}^{T} (U^{T} WU) (\tilde{γ} - {\hat{γ}}^{*})}^{1 / 2} {{(\hat{γ} - {\hat{γ}}^{*})}^{T} (U^{T} WU) (\tilde{γ} - γ^{*})}^{1 / 2} \\ \leq M_{4} \frac{n}{L_{n}} ∥ \tilde{γ} - {\hat{γ}}^{*} ∣ ∣_{l_{2}} ∥ \hat{γ} - {\hat{γ}}^{*} ∣ ∣_{l_{2}} . \end{array}

It follows from the triangle inequality and Lemmas 3 and 5 that,

∥ \tilde{γ} - {\hat{γ}}^{*} ∣ ∣_{l_{2}} \leq ∥ \tilde{γ} - \hat{γ} ∣ ∣_{l_{2}} + ∥ {\tilde{γ}}_{k_{0}} ∣ ∣_{l_{2}} = O_{p} [L_{n}^{1 / 2} {r_{n} + {(λ_{n} ρ_{n})}^{1 / 2} + ρ_{n}}] .

Hence,

{(\tilde{γ} - {\hat{γ}}^{*})}^{T} (U^{T} WU) (\hat{γ} - {\hat{γ}}^{*}) = O_{p} [L_{n}^{1 / 2} {r_{n} + {(λ_{n} ρ_{n})}^{1 / 2} + ρ_{n}}] \frac{n}{L_{n}} ∥ {\hat{γ}}_{k_{0}} ∣ ∣_{l_{2}} .

(31)

Lemma 4 implies that

∣ {(y - U \tilde{γ})}^{T} WU (\hat{γ} - {\hat{γ}}^{*}) ∣ = ∣ ε^{T} WU (\hat{γ} - {\hat{γ}}^{*}) ∣ = O_{P} (\frac{n r_{n}}{L_{n}^{1 / 2}}) ∥ {\hat{γ}}_{k_{0}} ∣ ∣_{l_{2}} .

(32)

Combining (29)–(32), we get

\begin{array}{l} p l (\hat{γ}) - p l ({\hat{γ}}^{*}) \\ \geq - O_{p} (\frac{r_{n}}{L_{n}^{1 / 2}}) ∥ {\hat{γ}}_{k_{0}} ∣ ∣_{l_{2}} - O_{p} (\frac{r_{n} + {(λ_{n} ρ_{n})}^{1 / 2} + ρ_{n}}{L_{n}^{1 / 2}}) ∥ {\hat{γ}}_{k_{0}} ∣ ∣_{l_{2}} + \frac{λ_{n}}{L_{n}^{1 / 2}} ∥ {\hat{γ}}_{k_{0}} ∣ ∣_{l_{2}} . \end{array}

(33)

Note that on the right hand side of (33), the third term dominates both the first term and the second term, since λ_n/max(r_n, ρ_n) → ∞. This contradicts the fact that pl(γ̂) − pl(γ^*) ≤ 0. We have thus proved part (a).

To prove part (b), write β = ((β⁽¹⁾)^T, (β⁽²⁾)^T)^T, where β⁽¹⁾ = (β₁, …, β_s)^T and β⁽²⁾ = (β_s+1, …, β_p)^T, and write γ = ((γ⁽¹⁾)^T, (γ⁽²⁾)^T)^T, where $γ^{(1)} = {(γ_{1}^{T}, \dots, γ_{s}^{T})}^{T}$ and $γ^{(2)} = {(γ_{s + 1}^{T}, \dots, γ_{p}^{T})}^{T}$ . Similarly, write $U_{i} = (U_{i}^{(1)}, U_{i}^{(2)})$ and U = (U⁽¹⁾; U⁽²⁾). Define the oracle version of γ̃,

\begin{array}{l} {\tilde{γ}}_{oracle} = arg {min}_{γ = {(γ^{{(1)}^{T}}, 0^{T})}^{T}} \frac{1}{n} \sum_{i = 1}^{n} w_{i} {({\tilde{y}}_{i} - U_{i} γ)}^{T} ({\tilde{y}}_{i} - U_{i} γ) \\ = (\begin{matrix} {(\sum_{i} U_{i}^{{(1)}^{T}} W_{i} U_{i}^{(1)})}^{- 1} (\sum_{i} U_{i}^{{(1)}^{T}} W_{i} {\tilde{y}}_{i}) \\ 0 \end{matrix}), \end{array}

(34)

which is obtained as if the information of the nonzero components were given; the corresponding vector of coefficient functions is denoted as β̃_oracle. Note that the true vector of coefficient functions is β = ((β ⁽¹⁾)^T, 0^T)^T. By Lemma 3, ||β̃_oracle − β||_L₂ = O_P (ρ_n) = o_P (1). By Lemmas 3 and 5, ||β̂ − β||_L₂ = o_P (1). Thus, with probabiltiy approaching one, ||β̃_k_,_oracle||_L₂ → ||β_k||_L₂ > 0 and ||β̂_k||_L₂ → ||β_k||_L₂ > 0, for k = 1, …, s. On the other hand, by the definition of β̃_oracle, ||β̃_k;oracle||_L₂ = 0 and by part (a) of the theorem, with probability approaching one, ||β̂_k||_L₂ = 0, for k = s + 1, …, p. Consequently,

\sum_{k = 1}^{p} p_{λ_{n}} (∥ {\tilde{β}}_{k, oracle} ∣ ∣_{L_{2}}) = \sum_{k = 1}^{p} p_{λ_{n}} (∥ {\hat{β}}_{k} ∣ ∣_{L_{2}}),

(35)

with probability approaching 1. ¿From part (a) of the theorem, with probability approaching 1, γ̂ = ((γ̂⁽¹⁾)^T, 0^T)^T. Let $\hat{γ} - {\tilde{γ}}_{oracle} = δ_{n} L_{n}^{1 / 2} v$ with v = ((v⁽¹⁾)^T, 0^T)^T and ||v⁽¹⁾||_l₂ = 1. Then, $∥ \hat{β} - {\tilde{β}}_{oracle} ∣ ∣_{L_{2}} ≍ L_{n}^{- 1} ∥ \hat{γ} - {\tilde{γ}}_{oracle} ∣ ∣_{l_{2}} = δ_{n}$ . By (24) and (35), and the argument following (24),

\begin{array}{l} 0 \geq p l (\hat{γ}) - p l ({\tilde{γ}}_{oracle}) = - \frac{2 δ_{n} L_{n}^{1 / 2}}{n} v^{T} {(U^{(1)})}^{T} W ε + \frac{δ_{n}^{2} L_{n}}{n} v^{T} {(U^{(1)})}^{T} W U^{(1)} v \\ \geq - O_{p} (r_{n}) δ_{n} + M_{3} δ_{n}^{2} . \end{array}

Hence, ||β̂ − β̃||_L₂ ≍ δ_n = Op(r_n), which together with ||β_oracle − −β||_L₂ = o_P (ρ_n) implies that ||β̂ − β||_L² = O_p(ρ_n + r_n). The desired result follows.

A.4 Proof of Corollary 1

By Theorem 6.27 in Schumaker (1981), ${inf}_{g \in G_{k}} ∥ β_{k} - g ∣ ∣_{L_{\infty}} = O (L_{k}^{- 2})$ , and thus $ρ_{n} = O (\sum_{k = 1}^{p} L_{k}^{- 2}) = O (L_{n}^{- 2})$ . Moreover, r_n ≍ (L_n/n)^1/2. Hence, the convergence rate is $∥ {\hat{β}}_{k} - β_{k} ∣ ∣_{L_{2}} = O_{P} {{(L_{n} / n)}^{1 / 2} + L_{n}^{- 2}}$ . When L_n ≍ n^1/5, ${(L_{n} / n)}^{1 / 2} + L_{n}^{- 2} ≍ n^{- 2 / 5}$ . The proof is complete.

A.5 Proof of Theorem 2

According to the proof of Lemma 5, with probability approaching one, ||γ̃_k||_k > aλ_n, ||γ̂ _k||_k > aλ_n, and thus p_{λ_n}(||γ̃_k||_k) = p_{λ_n}(||γ̂_k||_k), k = 1, …, s. By Theorem 1, with probability approaching 1, γ̂ = ((γ̂⁽¹⁾)^T, 0^T)^T is a local minimizer of pl(γ). Note that pl(γ⁽¹⁾) is quadratic in γ⁽¹⁾ = (γ₁, …, γ_s)^T when ||γ_k||_k > aλ_n, k = 1, …, s. Therefore, ∂pl(γ)/∂γ⁽¹⁾|_{γ⁽¹⁾=γ̂⁽¹⁾,γ⁽²⁾=0} = 0, which implies

{\hat{γ}}^{(1)} = {(\sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} U_{i}^{(1)})}^{- 1} (\sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} Y_{i}) .

Let

{\tilde{γ}}^{(1)} = {(\sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} U_{i}^{(1)})}^{- 1} (\sum_{i = 1}^{n} {(U_{i}^{(1)})}^{T} W_{i} E (Y_{i} ∣ x_{i})) .

Let Avar^*(γ̂⁽¹⁾) be a modification of Avar(γ̂⁽¹⁾) given in (20) by replacing Ω_{λ_n} in (19) with 0. Applying Theorem V.8 of Petrov (1975) as in the proof of Theorem 4.1 of Huang (2003) and using the arguments in the proof of Lemma A.8 of Huang et al. (2004), we obtain that, for any vector c_n with dimension $\sum_{k = 1}^{s} L_{s}$ and whose components are not all zero,

{c_{n}^{T} {Avar}^{*} ({\hat{γ}}^{(1)}) c_{n}}^{- 1 / 2} c_{n}^{T} ({\hat{γ}}^{(1)} - {\tilde{γ}}^{(1)}) \to N (0, 1) i n distribution .

For any s-vector a_n whose components are not all zero, choosing c_n = (B⁽¹⁾(t))^Ta_n yields

{[a_{n}^{T} {Avar}^{*} {{\hat{β}}^{(1)} (t)} a_{n}]}^{- 1 / 2} a_{n}^{T} {{\hat{β}}^{(1)} (t) - {\tilde{β}}^{(1)} (t)} \to N (0, 1) i n distribution,

which in turn yields the desired result.

Contributor Information

Lifeng Wang, Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, lifwang@mail.med.upenn.edu.

Hongzhe Li, Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, hongzhe@mail.med.upenn.edu.

Jianhua Z. Huang, Department of Statistics, Texas A&M University, College Station, TX 77843-3143, jian-hua@stat.tamu.edu.

References

Banerjee N, Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Research. 2003;31:7024–7031. doi: 10.1093/nar/gkg894. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel P, Li B. Regularization in Statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]
Chen G, Jensen S, Stoeckert C. Clustering of genes into regulons using integrated modeling (CORIM) Genome Biology. 2007;8(1):R4. doi: 10.1186/gb-2007-8-1-r4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis; Proceedings of National Academy of Sciences; 2003. pp. 3339–3344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford: Oxford University Press; 1994. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407499. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]
Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]
Hastie T, Tibshirani R. Varying-coefficient models. Journal of Royal Statistical Society, Ser B. 1993;55:757–796. [Google Scholar]
Hoover DR, Rice JA, Wu CO, Yang L. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998;85:809–822. [Google Scholar]
Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]
Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximation for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]
Huang JZ. Local asymptotics for polynomial spline regression. The Annals of Statistics. 2003;31:1600–1635. [Google Scholar]
Huang JZ, Wu CO, Zhou L. Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica. 2004;14:763–788. [Google Scholar]
Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]
Huang JZ, Liu L, Liu N. Estimation of large covariance matrices of longitudinal data with basis function approximations. J Comput Graph statist. 2007;16:189–209. [Google Scholar]
Huang JZ, Zhang L, Zhou L. Efficient estimation in marginal partially linear models for longitudinal/clustered data using splines. Scandinavian Journal of Statistics. 2007;34:451–477. [Google Scholar]
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al. Transcriptional regulatory networks in S. cerevisiae. Science. 2002;298:799–804. doi: 10.1126/science.1075090. [DOI] [PubMed] [Google Scholar]
Leng C, Zhang HH. Nonparametric model selection in hazard regression. Journal of Nonparametric Statistics. 2007;18:417–429. [Google Scholar]
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
Lin X, Carroll RJ. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of American Statistical Association. 2000;95:520–534. [Google Scholar]
Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models – COSSO. Annals of Statistics. 2006;34:2272–2297. [Google Scholar]
Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003;19:474–482. doi: 10.1093/bioinformatics/btg014. [DOI] [PubMed] [Google Scholar]
Petrov V. Sums of Independence Random Variables. Springer-Verlag; New York: 1975. [Google Scholar]
Rice JA. Functional and longitudinal data analysis: perspectives on smoothing. Statistica Sinica. 2004;14:631–647. [Google Scholar]
Rice JA, Silverman BW. Estimating the mean and covariance structure non-parametrically when the data are curves. Journal of Royal Statistical Society B. 1991;53:233–243. [Google Scholar]
Rice JA, Wu CO. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics. 2001;57:253–259. doi: 10.1111/j.0006-341x.2001.00253.x. [DOI] [PubMed] [Google Scholar]
Schumaker LL. Spline Functions: Basic Theory. Wiley; New York: 1981. [Google Scholar]
Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakola TS, Young RA. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001;106:697–708. doi: 10.1016/s0092-8674(01)00494-9. [DOI] [PubMed] [Google Scholar]
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stone CJ. Optimal global rates of convergence for nonparametric regression. Annals of Statistics. 1982;10:1348–1360. [Google Scholar]
Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society B. 1996;58:267–288. [Google Scholar]
Tsai HK, Lu SHH, Li WH. Statistical methods for identifying yeast cell cycle transcription factors. Proceedings of National Academy of Sciences. 2005;102(38):13532–13537. doi: 10.1073/pnas.0505874102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23:1486–1494. doi: 10.1093/bioinformatics/btm125. [DOI] [PubMed] [Google Scholar]
Wu WB, Pourahmadi M. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika. 2003;90:831–44. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society B. 2006;68:49–67. [Google Scholar]
Zeger SL, Diggle PJ. Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics. 1994;50:689–699. [PubMed] [Google Scholar]
Zhang H. Variable selection for support vector machines via smoothing spline ANOVA. Statistica Sinica. 2006;16:659–674. [Google Scholar]
Zhang H, Lin Y. Component Selection and Smoothing for Nonparametric Regression in Exponential Families. Statistica Sinica. 2006;16:1021–1042. [Google Scholar]
Zou H. The adaptive Lasso and its Oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society B. 2005;67:301–320. [Google Scholar]

[R1] Banerjee N, Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Research. 2003;31:7024–7031. doi: 10.1093/nar/gkg894. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bickel P, Li B. Regularization in Statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]

[R3] Chen G, Jensen S, Stoeckert C. Clustering of genes into regulons using integrated modeling (CORIM) Genome Biology. 2007;8(1):R4. doi: 10.1186/gb-2007-8-1-r4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis; Proceedings of National Academy of Sciences; 2003. pp. 3339–3344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Diggle PJ, Liang KY, Zeger SL. Analysis of longitudinal data. Oxford: Oxford University Press; 1994. [Google Scholar]

[R6] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Annals of Statistics. 2004;32:407499. [Google Scholar]

[R7] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association. 2004;99:710–723. [Google Scholar]

[R9] Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. The Annals of Statistics. 2004;32:928–961. [Google Scholar]

[R10] Hastie T, Tibshirani R. Varying-coefficient models. Journal of Royal Statistical Society, Ser B. 1993;55:757–796. [Google Scholar]

[R11] Hoover DR, Rice JA, Wu CO, Yang L. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998;85:809–822. [Google Scholar]

[R12] Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics. 2008;36:587–613. [Google Scholar]

[R13] Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximation for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]

[R14] Huang JZ. Local asymptotics for polynomial spline regression. The Annals of Statistics. 2003;31:1600–1635. [Google Scholar]

[R15] Huang JZ, Wu CO, Zhou L. Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica. 2004;14:763–788. [Google Scholar]

[R16] Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]

[R17] Huang JZ, Liu L, Liu N. Estimation of large covariance matrices of longitudinal data with basis function approximations. J Comput Graph statist. 2007;16:189–209. [Google Scholar]

[R18] Huang JZ, Zhang L, Zhou L. Efficient estimation in marginal partially linear models for longitudinal/clustered data using splines. Scandinavian Journal of Statistics. 2007;34:451–477. [Google Scholar]

[R19] Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al. Transcriptional regulatory networks in S. cerevisiae. Science. 2002;298:799–804. doi: 10.1126/science.1075090. [DOI] [PubMed] [Google Scholar]

[R20] Leng C, Zhang HH. Nonparametric model selection in hazard regression. Journal of Nonparametric Statistics. 2007;18:417–429. [Google Scholar]

[R21] Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]

[R22] Lin X, Carroll RJ. Nonparametric function estimation for clustered data when the predictor is measured without/with error. Journal of American Statistical Association. 2000;95:520–534. [Google Scholar]

[R23] Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models – COSSO. Annals of Statistics. 2006;34:2272–2297. [Google Scholar]

[R24] Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003;19:474–482. doi: 10.1093/bioinformatics/btg014. [DOI] [PubMed] [Google Scholar]

[R25] Petrov V. Sums of Independence Random Variables. Springer-Verlag; New York: 1975. [Google Scholar]

[R26] Rice JA. Functional and longitudinal data analysis: perspectives on smoothing. Statistica Sinica. 2004;14:631–647. [Google Scholar]

[R27] Rice JA, Silverman BW. Estimating the mean and covariance structure non-parametrically when the data are curves. Journal of Royal Statistical Society B. 1991;53:233–243. [Google Scholar]

[R28] Rice JA, Wu CO. Nonparametric mixed effects models for unequally sampled noisy curves. Biometrics. 2001;57:253–259. doi: 10.1111/j.0006-341x.2001.00253.x. [DOI] [PubMed] [Google Scholar]

[R29] Schumaker LL. Spline Functions: Basic Theory. Wiley; New York: 1981. [Google Scholar]

[R30] Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Wyrick JJ, Zeitlinger J, Gifford DK, Jaakola TS, Young RA. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell. 2001;106:697–708. doi: 10.1016/s0092-8674(01)00494-9. [DOI] [PubMed] [Google Scholar]

[R31] Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of Cell. 1998;9:3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Stone CJ. Optimal global rates of convergence for nonparametric regression. Annals of Statistics. 1982;10:1348–1360. [Google Scholar]

[R33] Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society B. 1996;58:267–288. [Google Scholar]

[R34] Tsai HK, Lu SHH, Li WH. Statistical methods for identifying yeast cell cycle transcription factors. Proceedings of National Academy of Sciences. 2005;102(38):13532–13537. doi: 10.1073/pnas.0505874102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007;23:1486–1494. doi: 10.1093/bioinformatics/btm125. [DOI] [PubMed] [Google Scholar]

[R36] Wu WB, Pourahmadi M. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika. 2003;90:831–44. [Google Scholar]

[R37] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society B. 2006;68:49–67. [Google Scholar]

[R38] Zeger SL, Diggle PJ. Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics. 1994;50:689–699. [PubMed] [Google Scholar]

[R39] Zhang H. Variable selection for support vector machines via smoothing spline ANOVA. Statistica Sinica. 2006;16:659–674. [Google Scholar]

[R40] Zhang H, Lin Y. Component Selection and Smoothing for Nonparametric Regression in Exponential Families. Statistica Sinica. 2006;16:1021–1042. [Google Scholar]

[R41] Zou H. The adaptive Lasso and its Oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]

[R42] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of Royal Statistical Society B. 2005;67:301–320. [Google Scholar]

PERMALINK

Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements

Lifeng Wang

Hongzhe Li

Jianhua Z Huang

Roles

SUMMARY

1 Introduction

2 Basis Function Expansion and Regularized Estimation Using the SCAD Penalty

Remark

3 Computational Algorithm

Remark

4 Selection of Tuning Parameters

Lemma 1

Figure 1.

5 Large Sample Properties

Theorem 1

Corollary 1

Theorem 2

6 Monte Carlo Simulation

Figure 2.

Table 1.

7 Applications to Real Datasets

7.1 Application to AIDS data

Figure 3.

7.2 Application to yeast cell cycle gene expression data

Figure 4.

8 Discussion

Acknowledgments

Appendix

A.1 Proof of Lemma 1

A.2 Useful lemmas

Lemma 2

Lemma 3

Lemma 4

Proof of Lemma 4

Lemma 5

Proof of Lemma 5

A.3 Proof of Theorem 1

A.4 Proof of Corollary 1

A.5 Proof of Theorem 2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases