Feature Selection for Varying Coefficient Models With Ultrahigh Dimensional Covariates

Jingyuan Liu; Runze Li; Rongling Wu

doi:10.1080/01621459.2013.850086

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2014 Mar 19;109(505):266–274. doi: 10.1080/01621459.2013.850086

Feature Selection for Varying Coefficient Models With Ultrahigh Dimensional Covariates

Jingyuan Liu ^1,^✉, Runze Li ², Rongling Wu ³

PMCID: PMC3963210 NIHMSID: NIHMS561310 PMID: 24678135

Abstract

This paper is concerned with feature screening and variable selection for varying coefficient models with ultrahigh dimensional covariates. We propose a new feature screening procedure for these models based on conditional correlation coefficient. We systematically study the theoretical properties of the proposed procedure, and establish their sure screening property and the ranking consistency. To enhance the finite sample performance of the proposed procedure, we further develop an iterative feature screening procedure. Monte Carlo simulation studies were conducted to examine the performance of the proposed procedures. In practice, we advocate a two-stage approach for varying coefficient models. The two stage approach consists of (a) reducing the ultrahigh dimensionality by using the proposed procedure and (b) applying regularization methods for dimension-reduced varying coefficient models to make statistical inferences on the coefficient functions. We illustrate the proposed two-stage approach by a real data example.

Keywords: Feature selection, varying coefficient models, ranking consistency, sure screening property

1. Introduction

Varying coefficient models with ultrahigh dimensional covariates (ultrahigh dimensional varying coefficient models for short) could be very useful for analyzing genetic study data to examine varying gene effects. This study was motivated by an empirical analysis of a subset of Framingham Heart Study (FHS) data. See Section 3.2 for more details. Of interest in this empirical analysis is to identify genes strongly associated with body mass index (BMI). Some initial exploratory analysis on this data subset indicates that the effects of genes on the BMI are age-dependent. Thus, it is natural to apply the varying coefficient model for this analysis. There are thousands of single-nucleotide polymorphisms available in the FHS database, leading to the ultrahigh dimensionality. While only hundreds of samples are available, as is typical in genetic study data. Thus, feature screening and variable selection become indispensable for estimation of ultrahigh dimensional varying coefficient models.

Some variable selection methods have been developed for varying coefficient models with low dimensional covariates in literature. Li and Liang (2008) proposed a generalized likelihood ratio test to select significant covariates with varying effects. Wang, Li and Huang (2008) developed a regularized estimation procedure based on the basis function approximations and the SCAD penalty (Fan and Li, 2001) to simultaneously select significant variables and estimate the nonzero smooth coefficient functions. Wang and Xia (2009) proposed a shrinkage method integrating local polynomial regression techniques (Fan and Gijbels, 1996) and LASSO (Tibshirani, 1996). Nevertheless, these variable selection procedures were developed for the varying coefficient models with fixed dimensional covariates. As a result, they cannot be directly applied to the ultrahigh dimensional varying coefficient models.

To deal with the ultrahigh dimensionality, one appealing method is the two-stage approach. First, a computationally efficient screening procedure is applied to reduce the ultra-high dimensionality to a moderate scale under sample size, and then the final sparse model is recovered from the screened submodel by a regularization method. Several screening techniques for the first stage have been developed for various models. Fan and Lv (2008) showed that the sure independence screening (SIS) possesses sure screening property in the linear model setting. Hall and Miller (2009) extended the methodology from linear models to nonlinear models using generalized empirical correlation learning, but it is not trivial to choose an optimal transformation function. Fan and Song (2010) modified SIS for the generalized linear model by ranking the maximum marginal likelihood estimates. Fan, Feng and Song (2011) explored the feature screening technique for ultrahigh dimensional additive models, by ranking the magnitude of spline approximations of the nonparametric components. Zhu, Li, Li and Zhu (2011) proposed a sure independence ranking and screening procedure to select important predictors under the multi-index model setting. Li, Peng, Zhang and Zhu (2012) proposed rank correlation feature screening for a class of semiparametric models, such as transformation regression models and single-index models under monotonic constraint to the link function without involving nonparametric estimation, even when there are nonparametric functions in the models. Model-free screening procedures have been advocated in the literature. Li, Zhong and Zhu (2012) developed a model free feature screening procedure based on a distance correlation, which are directly applicable for multiple response and grouped predictors. He, Wang and Hong (2013) proposed a quantile-adaptive model-free feature screening procedure for heterogeneous data. Our paper aims to develop a kernel-regression based screening method specifically for ultrahigh dimensional varying coefficient models to reduce dimensionality.

Suppose that the varying-coefficients in the varying coefficient models are functions of covariate u. Thus, conditioning on u, the varying coefficient models are linear models. Therefore, it is natural to employ the conditional Pearson correlation coefficient as a measure for the strength of association between a predictor and the response. In this paper, we propose using kernel regression techniques to estimate the conditional correlation coefficients, and further develop a marginal utility for feature screening based on the kernel regression estimate. We investigate the finite sample performance of the proposed procedure via Monte Carlo simulation study and illustrate the proposed methodology by an empirical analysis of a subset of FHS data. This paper makes the following theoretical contributions to the literature. We first establish the concentration inequality for the kernel regression estimate of the conditional Pearson correlation coefficient. Based on the concentration inequality, we further establish several desirable theoretical properties for the proposed procedure. We show that the proposed procedure possesses the consistency in ranking property (Zhu, et al., 2011). By consistency in ranking, it means with probability tending to 1, the important predictors rank before the unimportant ones. We also show that the proposed procedure enjoys the sure screening property (Fan and Lv, 2008) under the setting of ultrahigh dimensional varying coefficient models. The sure screening property guarantees the probability that the model chosen by our screening procedure includes the true model tends to 1 in an exponential rate of the sample size.

The rest of the paper is organized as follows. In section 2, we propose a new feature screening procedure for ultrahigh dimensional varying coefficient models. In this section, we also study the theoretical property of the proposed procedure. In section 3, Monte Carlo simulations are conducted to assess the finite performance of the proposed procedure. In addition, we propose a two-stage approach for ultrahigh dimensional varying coefficient models, and illustrate the approach by examining the age-specific SNP effects on BMI using the FHS data. We also propose an iterative screening procedure in section 3. Conclusion remark is given in section 4, and the technical proofs are given in the online supplement.

2. A New Feature Screening Procedure

Let y be the response, and x = (x₁, …, x_p)^T ∈ ℝ^p be the p-dimensional predictor. Consider a varying coefficient model

y = β_{0} (u) + x^{T} β (u) + ε,

(2.1)

where E(ε|x, u) = 0, β₀(u) is the intercept function and β(u) = (β₁(u), …, β_p(u))^T consists of p unknown smooth functions β_j(u), j = 1, …, p, of univariate variable u.

Note that given u, the varying coefficient model becomes a linear regression model. Fan and Lv (2008) proposed a sure independence screening procedure for linear regression models based on Pearson correlation coefficient. Thus, it is natural to consider conditional Pearson correlation coefficient for feature screening. Specifically, given u, the conditional correlation between the response y and each predictor x_j, j = 1, …, p, is defined as

ρ (x_{j}, y ∣ u) = \frac{cov (x_{j}, y ∣ u)}{\sqrt{cov (x_{j}, x_{j} ∣ u) cov (y, y ∣ u)}},

(2.2)

which is a function of u. Define the marginal utility for feature screening as

ρ_{j 0}^{*} = E {ρ^{2} (x_{j}, y ∣ u)} .

To estimate $ρ_{j 0}^{*}$ , let us proceed with estimation of ρ(x_j, y|u), which essentially requires estimation of five conditional means E(y|u), E(y²|u), E(x_j|u), $E (x_{j}^{2} ∣ u)$ and E(x_jy|u). Through-out this paper, it is assumed that these five conditional means are nonparametric smooth functions of u. Therefore, the conditional correlation in (2.2) can be estimated through nonparametric mean estimation techniques. We will use the kernel smoothing method (Fan and Gijbels, 1996) to estimate these conditional means.

Suppose {(u_i, x_i, y_i), i = 1, …, n} is a random sample from (2.1). Let K(t) be a kernel function, and h be a bandwidth. Then the kernel regression estimates for E(y|u) is

\hat{E} (y ∣ u) = \sum_{i = 1}^{n} \frac{K_{h} (u_{i} - u) y_{i}}{\sum_{i = 1}^{n} K_{h} (u_{i} - u)},

(2.3)

where K_h(t) = h⁻¹ K(t/h). Similarly, we may define kernel regression estimates Ê(y²|u), Ê(x_j|u), $\hat{E} (x_{j}^{2} ∣ u)$ and Ê(x_jy|u) for E(y²|u), E(x_j|u), $E (x_{j}^{2} ∣ u)$ and E(x_jy|u), respectively. The conditional covariance cov(x_j, y|u) can be estimated by $\hat{cov} (x_{j}, y ∣ u) = \hat{E} (x_{j} y ∣ u) - \hat{E} (x_{j} ∣ u) \hat{E} (y ∣ u)$ , and the conditional correlation is naturally estimated by

\hat{ρ} (x_{j}, y ∣ u) = \frac{\hat{cov} (x_{j}, y ∣ u)}{\sqrt{\hat{cov} (x_{j}, x_{j} ∣ u) \hat{cov} (y, y ∣ u)}} .

(2.4)

Remark

We employ the kernel regression rather than local linear regression because local linear regression estimates cannot guarantee $\hat{cov} (y, y ∣ u) \geq 0$ and $\hat{cov} (x_{j}, x_{j} ∣ u) \geq 0$ . Furthermore, it is required to set the bandwidth h the same for all the five conditional means in order to guarantee that |ρ̂(x_j, y|u)| ≤ 1. In our numerical studies, we first select an optimal bandwidth for E(x_jy|u) by using a plug-in method (Ruppert, Sheather and Wang, 1995), and then use this bandwidth for other four conditional means. We empirically studied the impact of bandwidth selection on the performance of the proposed screening procedure in section 3.1. For our simulation study, the proposed procedure performs quite well provided that the bandwidth lies within an appropriate range.

The plug-in estimate of $ρ_{j 0}^{*}$ is

{\hat{ρ}}_{j}^{*} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ρ}}^{2} (x_{j}, y ∣ u_{i}) .

(2.5)

Based on ${\hat{ρ}}_{j}^{*}$ , we propose a screening procedure for ultrahigh dimensional varying coefficient models as follows: sort ${\hat{ρ}}_{j}^{*}$ , j = 1, …, p in the decreasing order, and define the screened submodel as

\hat{M} = {j : 1 \leq j \leq p, {\hat{ρ}}_{j}^{*} ranks among the first d},

(2.6)

where the submodel size d is taken to be smaller than the sample size n. Thus, the ultrahigh dimensionality p is reduced to the moderate scale d. Fan and Lv (2008) suggested setting d = [n/log(n)], where [a] refers to the integer part of a. In the kernel regression setting, it is known that the effective sample size is nh rather than n, and the optimal rate of the bandwidth h = O(n^−1/5) (Fan and Gijbels, 1996). Thus we may set d = [n^4/5/log(n^4/5)] for ultrahigh dimensional varying coefficient models. We will examine the impact of the choice of d in our simulation by considering d = ν[n^4/5/log(n^4/5)] with different values for ν. This proposed procedure is referred to as conditional correlation sure independence screening (CC-SIS for short).

We next study the theoretical properties of the newly proposed screening procedure CC-SIS. Let us introduce some notation first. The support of u is assumed to be bounded and is denoted by Inline graphic = [a, b] with finite constants a and b. Define the true model index set with cardinality p_* and its complement $M_{*}^{c}$ by

\begin{array}{l} M_{*} = {1 \leq j \leq p : β_{j} (u) \neq 0 for some u \in U} \\ M_{*}^{c} = {1 \leq j \leq p : β_{j} (u) \equiv 0 for all u \in U} . \end{array}

Denote the truly important predictor vector by Inline graphic , a vector consisting of x_j with j ∈ . That is, if = {j₁, …, j_p_*}, then = (x_j₁, · · ·, x_{j_p*})^T. Similarly, define $x_{M_{*}^{c}}$ , (u) and $β_{M_{*}^{c}} (u)$ . Furthermore, define (u) to be a vector consisting of ρ(x_j, y|u) with j ∈ . Denote by λ_max{A} and λ_min{A} the largest and smallest eigenvalues of the matrix A, respectively, and “a_n > b_n uniformly in n” means “lim inf_n_→∞{a_n − b_n} > 0”. Furthermore, denote z^⊗2 = zz^T for notation simplicity.

The following two conditions are needed for Theorem, which characterizes the relation-ship of $ρ_{j 0}^{*}$ between the truly important and unimportant predictors.

(B1)
The following inequality holds uniformly in n:
$min_{j \in M_{*}} ρ_{j 0}^{*} > E {\frac{λ_{max} {cov (x_{M_{*}}, x_{M_{*}^{c}}^{T} ∣ u) cov (x_{M_{*}^{c}}, x_{M_{*}}^{T} ∣ u)} \cdot λ_{max} {ρ_{M_{*}}^{\otimes 2} (u)}}{λ_{min}^{2} {cov (x_{M_{*}} ∣ u)}}} .$ (2.7)
(B2)
Assume that conditioning on $x_{M_{*}}^{T} β_{M_{*}} (u)$ and u, x and ε are independent. Further assume that the following linearity condition is valid:
$E {x ∣ x_{M_{*}}^{T} β_{M_{*}} (u), u} = cov (x, x_{M_{*}}^{T} ∣ u) β_{M_{*}} (u) {cov (x_{M_{*}}^{T} β_{M_{*}} (u) ∣ u)}^{- 1} β_{M_{*}}^{T} (u) x_{M_{*}}$ (2.8)

Conditions (B1) and (B2) are adapted from Zhu, et al. (2011). (B1) requires that the population level unconditioned-squared correlation cannot be too small. The first assumption in (B2) implies that y depends on x through $x_{M_{*}}^{T} β_{M_{*}} (u)$ , and (2.8) refers to the conditional linearity condition.

Theorem 1

Under conditions (B1) and (B2), it follows that

\underset{n \to \infty}{lim inf} {min_{j \in M_{*}} ρ_{j 0}^{*} - max_{j \in M_{*}^{c}} ρ_{j 0}^{*}} > 0.

(2.9)

The proof of Theorem is similar to Theorem 1 of Zhu, et al. (2011) and therefore is given in the supplementary material of this paper.

The inequality (2.9) provides a clear separation between the important and unimportant predictors in terms of $ρ_{j 0}^{*}$ . It rules out the situation when certain unimportant predictors have large $ρ_{j 0}^{*}$ ’s and are selected only because they are highly correlated with the true ones. This is a necessary condition for the ranking consistency property established below.

The following regularity conditions are used to establish the ranking consistency property and sure screening property of the CC-SIS.

(C1)
Denote the density function of u by f (u). Assume that f (u) has continuous second order derivative on .
(C2)
The kernel K(·) is a symmetric density function with finite support and is bounded uniformly over its support.
(C3)
The random variables x_j and y satisfy the sub-exponential tail probability uniformly in p. That is, there exists s₀ > 0, such that for 0 ≤ s < s₀,
$\begin{array}{l} sup_{u \in U} max_{1 \leq j \leq p} E {exp ({s x}_{j}^{2} ∣ u)} < \infty, sup_{u \in U} E {exp ({s y}^{2} ∣ u)} < \infty, \\ sup_{u \in U} max_{1 \leq j \leq p} E {exp (s x_{j} y ∣ u)} < \infty . \end{array}$
(C4)
All conditional means E(y|u), E(y²|u), E(x_j|u), $E (x_{j}^{2} ∣ u)$ and E(x_jy|u), their first and second order derivatives are finite uniformly in u ∈ . Further assume that
$inf_{u \in U} min_{1 \leq j \leq p} var (x_{j} ∣ u) > 0, inf_{u \in U} var (y ∣ u) > 0.$

Conditions (C1) and (C2) are mild conditions on the density function f (u) and the kernel function K(·), which can be guaranteed by most commonly used distributions and kernels. Moreover, (C2) implies that K(·) has every finite moment, i.e. E(|K(u)|^r) < ∞, for any r > 0. Although the Gaussian kernel function does not satisfy (C2), it can be shown that results in Theorems 2 and 3 are still valid for the Gaussian kernel function. Condition (C3) is relatively strong and only used to facilitate the technical proofs. Condition (C4) requires the mean-related quantities bounded and the variances positive, in order to guarantee that the conditional correlation is well defined. We first establish the ranking consistency property of CC-SIS.

Theorem 2

(Ranking Consistency Property) Under conditions (B1), (B2), (C1)-(C4), suppose the bandwidth h → 0 but nh³ → ∞ as n → ∞. Then for p = o{exp(an)} with some a > 0, we have

\underset{n \to \infty}{lim inf} {min_{j \in M_{*}} {\hat{ρ}}_{j}^{*} - max_{j \in M_{*}^{c}} {\hat{ρ}}_{j}^{*}} > 0 i n probability .

The proof of Theorem 2 is given in the online supplement of this paper. Theorem 2 states that with an overwhelming probability, the truly important predictors have larger ρ̂^*’s than the unimportant ones, and hence all the true predictors are ranked in the top by the proposed screening procedure. We next develop the sure screening property of CC-SIS.

Theorem 3

(Sure Screening Property) Under conditions (C1)-(C4), suppose the bandwidth h = O(n⁻^γ) where 0 < γ < 1/3, then we have

P (max_{1 \leq j \leq p} ∣ {\hat{ρ}}_{j}^{*} - ρ_{j 0}^{*} ∣ > c_{3} \cdot n^{- κ}) \leq O {np exp (- n^{\frac{1}{5} - κ} / ξ)},

and if we further assume that there exist some c₃ > 0 and 0 ≤ κ < γ, such that

min_{j \in M_{*}} ρ_{j 0}^{*} \geq 2 c_{3} n^{- κ} .

(2.10)

then

P (M_{*} \subset \hat{M}) \geq 1 - O {{n s}_{n} exp (- n^{γ - κ} / ξ)},

where ξ is some positive constant determined by c₃, and s_n is the cardinality of Inline graphic , which is sparse and may vary with n.

The proof of Theorem 3 is given in the online supplement of this paper. Condition (2.10) guarantees the true unconditioned-squared correlations between the important x_j’s and y to be bounded away from 0. However, the lower bound depends on n, thus $ρ_{j 0}^{*}$ ’s are allowed to go to 0 in the asymptotic sense. This condition rules out the situation where the predictors are marginally uncorrelated with y but jointly correlated. Theorem 3 ensures that the probability of the true model being selected into the screened submodel by CC-SIS tends to 1 with an exponential rate.

3. Numerical Examples and Extensions

In this section, we first conduct Monte Carlo simulation study to illustrate the ranking consistency and the sure screening property of the proposed procedure empirically, and compare its finite sample performance with some other screening procedures under different model settings. We further consider a two-stage approach for analyzing ultrahigh dimensional data using varying coefficient models in section 3.2. We study an iterative sure screening procedure to enhance finite sample performance of CC-SIS in section 3.3. The kernel function is taken to be K(u) = 0.75(1 − u²)₊ in all the numerical studies.

For each simulation example (i.e. Examples 1 and 3 below), the covariate u and x = (x₁, …, x_p)^T are generated as follows: First draw u^* and x from (u^*, x) ~ N(0, Σ), where Σ is a (p + 1) × (p + 1) covariance matrix with element σ_ij = ρ^|ⁱ⁻^j^|, i, j = 1, · · ·, p + 1.

We consider ρ = 0.8 and 0.4 for a high correlation and a low correlation, respectively. Then take u = Φ(u^*), where Φ(·) is the cumulative distribution function of the standard normal distribution. Thus, u follows a uniform distribution U(0, 1) and is correlated with x, and all the predictors x₁, …, x_p are correlated with each other. The random error ε is drawn from N(0, 1). The model dimension p is taken to be 1000, and the sample size n is 200. This leads to [n^4/5/log(n^4/5)] = 16. In our simulation we consider d = ν[n^4/5/log(n^4/5)] with ν = 1, 2 and 3. All the simulation results are based on 1000 replications.

The following criteria are used to assess the performance of CC-SIS:

R_j: The average of the ranks of x_j in terms of the screening criterion based on 1000 replications. For instance, R_j for SIS is the average rank of the Pearson correlation between x_j and y in the decreasing order; R_j for CC-SIS refers to the average rank of ${\hat{ρ}}_{j}^{*}$ .
M: the minimum size of the submodel that contains all the true predictors. In other words, M is the largest rank of the true predictors: M = Rj, where is the true model. We report the 5%, 25%, 50%, 75% and 95% quantiles of M from 1000 repetitions.
p_a: The proportion of submodels with size d that contain all the true predictors among 1000 simulations.
p_j: The proportion of submodels with size d that contain x_j among 1000 simulations.

The criteria are used to empirically verify the theoretical properties in Theorems 2 and 3. The ranking consistency of a screening procedure refers to the property that the screening scores of the true predictors rank above the unimportant ones, hence a reasonable screening procedure is expected to guarantee that R_j ’s of the true predictors are small, and so is the minimum submodel size M. The sure screening property claims an overwhelming probability of all true predictors being selected into Inline graphic , thus it can be verified if the p_a and p_j’s of the important x_j’s are close to one. In addition, M being smaller than d also implies that all important predictors are included in the submodel with the size d.

3.1. Monte Carlo simulation

In this section, we conduct Monte Carlo simulations to examine the finite sample performance of CC-SIS, and compare its performance with that of SIS (Fan and Lv, 2008), SIRS (Zhu, et al., 2011) and DC-SIS (Li, Zhong and Zhu, 2012).

Example 1

The true model index set in this example is taken to be Inline graphic = {2, 100, 400, 600, 1000}. To make a fair comparison, we consider the following two model settings. In Case I, the nonzero coefficient functions are truly varying over u; in Case II, the nonzero coefficient functions are constants, therefore the true model indeed is a linear model. Specifically, the coefficient functions are given below.

Case I
The nonzero coefficient functions are defined by
$\begin{array}{l} β_{2} (u) = 2 I (u > 0.4), & β_{100} (u) = 1 + u, β_{400} (u) = {(2 - 3 u)}^{2} \\ β_{600} (u) = 2 sin (2 π u), & β_{1000} (u) = exp {u / (u + 1)} . \end{array}$
Case II
The nonzero coefficient functions are defined by
$β_{2} (u) = 1, β_{100} (u) = 0.8, β_{400} (u) = 1.2, β_{600} (u) = - 0.8, β_{1000} (u) = - 1.2.$

First consider Case I, in which data were generated from a varying coefficient model. Table 1 reports R_j ’s of the active predictors. From the output, the ranking consistency of CC-SIS is demonstrated by the fact that ${\hat{ρ}}_{j}^{*}$ ’s of the active predictors rank in the top for both ρ = 0.4 and ρ = 0.8. However, the SIS ranks x₆₀₀ behind and leaves it aliased with the unimportant x_j’s. The reason is that β₆₀₀(u) = 2 sin(2πu) has mean 0 if u is considered as a random variable from U(0, 1). Therefore, when we mis-specify the varying coefficient model as a linear regression model and apply SIS, the constant coefficient β₆₀₀ is indeed 0, and hence the true marginal correlation between x₆₀₀ and y is 0. Therefore, the magnitude of the Pearson correlation for x₆₀₀ is expected to be small, although x₆₀₀ is functionally important, as successfully detected by CC-SIS. In addition, SIRS and DC-SIS both fail to identify x₆₀₀ likewise under the varying coefficient model setting.

Table 1.

R_j of the true predictors for Example 1.

Method	Low correlation: ρ = 0.4					High correlation: ρ = 0.8
Method	R₂	R₁₀₀	R₄₀₀	R₆₀₀	R₁₀₀₀	R₂	R₁₀₀	R₄₀₀	R₆₀₀	R₁₀₀₀
	Case I: varying coefficient model

SIS	3.5	1.5	6.5	461.3	2.2	7.9	1.8	14.0	468.1	3.1
SIRS	3.7	1.6	10.7	486.8	2.1	9.4	1.8	15.7	454.5	3.2
DC-SIS	3.1	1.6	10.1	350.6	2.2	6.3	1.9	12.9	341.8	3.4
CC-SIS	2.7	2.1	3.8	3.8	3.3	6.8	2.1	6.9	5.9	3.9

	Case II: linear model

SIS	3.1	5.4	1.7	5.4	1.7	5.2	12.6	1.9	12.9	2.2
SIRS	3.2	6.3	1.8	5.8	1.8	5.6	14.1	2.0	14.4	2.3
DC-SIS	3.2	6.4	1.8	6.8	1.8	5.6	13.6	2.1	15.0	2.1
CC-SIS	3.1	6.6	1.7	7.0	1.7	9.2	12.3	1.7	12.7	1.9

Open in a new tab

The proportions p_a and p_j ’s for the important predictors are tabulated in Table 2. All p_a and p_j ’s of CC-SIS are close to one, even for the smallest d = 16, which illustrates the sure screening property. While the low p₆₀₀ and p_a values for the other three screening procedures imply their failure of detecting x₆₀₀, and increasing the submodel size d does not help much.

Table 2.

The selecting rates p_a and p_j ’s for Example 1.

d	Method	Low correlation: ρ = 0.4						High correlation: ρ = 0.8
d	Method	p₂	p₁₀₀	p₄₀₀	p₆₀₀	p₁₀₀₀	p_a	p₂	p₁₀₀	p₄₀₀	p₆₀₀	p₁₀₀₀	p_a
		Case I: varying coefficient model

16	SIS	0.99	1.00	0.95	0.03	0.99	0.03	0.93	0.99	0.81	0.00	0.99	0.00
	SIRS	0.99	1.00	0.90	0.01	1.00	0.01	0.90	1.00	0.76	0.00	0.99	0.00
	DC-SIS	0.99	1.00	0.92	0.02	1.00	0.02	0.96	1.00	0.81	0.00	0.99	0.00
	CC-SIS	1.00	1.00	1.00	0.99	0.99	0.99	0.96	1.00	0.96	0.97	0.99	0.89

32	SIS	1.00	1.00	0.98	0.07	1.00	0.07	0.98	1.00	0.95	0.03	1.00	0.02
	SIRS	0.99	1.00	0.94	0.03	1.00	0.03	0.97	1.00	0.92	0.03	1.00	0.02
	DC-SIS	1.00	1.00	0.95	0.05	1.00	0.05	0.99	1.00	0.94	0.02	1.00	0.02
	CC-SIS	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.99

48	SIS	1.00	1.00	0.99	0.09	1.00	0.09	0.99	1.00	0.97	0.05	1.00	0.05
	SIRS	1.00	1.00	0.96	0.04	1.00	0.03	0.99	1.00	0.96	0.05	1.00	0.04
	DC-SIS	1.00	1.00	0.97	0.08	1.00	0.08	1.00	1.00	0.97	0.05	1.00	0.05
	CC-SIS	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

		Case II: linear model

16	SIS	1.00	0.98	1.00	0.98	1.00	0.96	0.99	0.79	1.00	0.78	1.00	0.61
	SIRS	0.99	0.96	1.00	0.96	1.00	0.93	0.96	0.75	1.00	0.74	0.99	0.54
	DC-SIS	0.99	0.96	1.00	0.96	0.99	0.92	0.97	0.73	1.00	0.73	0.99	0.52
	CC-SIS	1.00	0.96	1.00	0.95	1.00	0.91	0.91	0.82	1.00	0.82	1.00	0.62

32	SIS	1.00	0.99	1.00	0.99	1.00	0.99	1.00	0.97	1.00	0.97	1.00	0.94
	SIRS	1.00	0.98	1.00	0.99	1.00	0.97	1.00	0.95	1.00	0.94	1.00	0.89
	DC-SIS	1.00	0.98	1.00	0.98	1.00	0.97	1.00	0.96	1.00	0.94	1.00	0.90
	CC-SIS	1.00	0.98	1.00	0.98	1.00	0.96	0.99	0.97	1.00	0.96	1.00	0.92

48	SIS	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.99	1.00	0.98
	SIRS	1.00	0.99	1.00	0.99	1.00	0.98	1.00	0.98	1.00	0.97	1.00	0.96
	DC-SIS	1.00	0.99	1.00	0.99	1.00	0.98	1.00	0.98	1.00	0.97	1.00	0.96
	CC-SIS	1.00	0.99	1.00	0.98	1.00	0.97	1.00	1.00	1.00	0.98	1.00	0.97

Open in a new tab

Similar conclusions can be drawn from Table 3. SIS, SIRS and DC-SIS need large models to include all the true predictors due to the low rank of x₆₀₀. Consequently, the models with size d do not guarantee all the important predictors to be selected, even with the largest d = 48. CC-SIS, on the other hand, requires only fairly small models, and thus all of the important variables can be selected with any of the three choices of size d. Therefore, both ranking consistency and sure screening property are illustrated in this table.

Table 3.

The quantiles of M for Example 1.

Method	Low correlation: ρ = 0.4					High correlation: ρ = 0.8
Method	5%	25%	50%	75%	95%	5%	25%	50%	75%	95%
	Case I: varying coefficient model

SIS	26.0	185.8	454.5	728.3	951.0	52.0	228.8	442.5	701.0	951.0
SIRS	61.0	223.8	475.0	744.3	943.0	52.0	208.0	440.0	675.0	922.2
DC-SIS	32.0	135.0	299.5	522.5	841.1	51.0	142.0	297.0	500.3	790.1
CC-SIS	5.0	5.0	5.0	5.0	7.0	6.0	8.0	10.0	13.0	20.0

	Case II: linear model

SIS	5.0	5.0	5.0	7.0	14.0	8.0	11.0	15.0	20.0	35.1
SIRS	5.0	5.0	6.0	7.0	22.0	8.0	12.0	15.0	22.0	51.0
DC-SIS	5.0	5.0	6.0	8.0	24.0	8.0	12.0	16.0	21.0	41.1
CC-SIS	5.0	5.0	6.0	7.0	29.1	8.0	11.0	15.0	20.0	44.0

Open in a new tab

In addition, comparing the models with the two ρ’s, those with ρ = 0.4 typically perform better than those with ρ = 0.8 for all the four screening procedures. This is because when the predictors are highly correlated (ρ = 0.8), the screening scores of some unimportant variables are inflated by the adjacent important ones, hence the unimportant predictors may be selected due to their strong correlation with the true predictors.

For Case II, the four screening procedures perform similarly well in terms of all the criteria. Thus CC-SIS is still valid for linear models, but it pays a price of computational cost. Therefore, if the underlying model is known to be linear, one may prefer SIS due to its easier implementation.

Furthermore, we study the performance of CC-SIS when the kernel is over-smooth with a larger bandwidth h_L = 1.25h and under-smooth with a smaller h_S = 0.75h, where h is the optimal bandwidth chosen by the plug-in method introduced in the last section. The average rank, the selecting rate and the minimum model size are very similar to Tables 1, 2 and 3, thus are omitted to save space. Therefore, CC-SIS is stable with respect to the bandwidth selection provided that the chosen bandwidth lies within the right range.

3.2. Two-stage approach for varying coefficient models and an application

Consider the varying coefficient model (2.1). Although CC-SIS can reduce the ultrahigh dimensionality p to the moderate scale d, a subsequent step is needed to further select the significant variables and recover the final sparse model. In this section, we discuss the entire variable section procedure, referred to as a two-stage approach.

In the screening stage, CC-SIS is conducted to obtain the submodel index set (2.6) with size d = [n^4/5/log(n^4/5)]. With slight abuse of notation, we denote the screened submodel by

y = x^{T} β (u) + ε .

(3.1)

The screened predictor x = (1, x_s₁, …, x_{s_d})^T ∈ ℝ^d⁺¹ where s_i ∈ Inline graphic in (2.6), and the screened coefficient vector β(u) = (β₀(u), β_s₁(u), …, βs_d (u))^T ∈ ℝ^d⁺¹.

In the post-screening variable selection stage, the modified penalized regression procedures are applied to further select important variables and estimate the coefficient function β(u) in model (3.1). Following the idea of the KLASSO method (Wang and Xia, 2009), we aim to estimate the n × (d + 1) matrix

B = {β (u_{1}), \dots, β (u_{n})}^{T} = (b_{1}, \dots, b_{d + 1}),

where b_j = (β_j (u₁), …, β_j (u_n))^T ∈ ℝⁿ^×1 is the jth column of B. The estimator B̂_λ of B is defined by

{\hat{B}}_{λ} = {argmin}_{B \in ℝ^{n \times (d + 1)}} {\sum_{t = 1}^{n} \sum_{i = 1}^{n} {y_{i} - x_{i}^{T} β (u_{t})}^{2} K_{h} (u_{t} - u_{i}) + n \sum_{j = 1}^{d + 1} p_{λ} (‖ b_{j} ‖)},

(3.2)

where || · || is the Euclidean norm, p_λ(·) is the penalty function, and λ is the tuning parameter to be chosen by a data-driven method.

With a chosen λ, a modified iterative algorithm based on the local quadratic approximation (Fan and Li, 2001) is applied to solve the minimization problem (3.2):

Set the initial value ${\hat{B}}_{λ}^{(0)}$ to be the unpenalized estimator (Fan and Zhang, 2000):
${\hat{B}}_{λ}^{(0)} = {argmin}_{B \in ℝ^{n \times (d + 1)}} {\sum_{t = 1}^{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β (u_{t}))}^{2} K_{h} (u_{t} - u_{i})},$
Denote the mth-step estimator of B by
${\hat{B}}_{λ}^{(m)} = {{\hat{β}}_{λ}^{(m)} (u_{1}), \dots, {\hat{β}}_{λ}^{(m)} (u_{n})}^{T} = ({\hat{b}}_{λ, 1}^{(m)}, \dots, {\hat{b}}_{λ, d + 1}^{(m)}) .$

Then the (m + 1)th-step estimator is ${\hat{B}}_{λ}^{(m + 1)} = {{\hat{β}}_{λ}^{(m + 1)} (u_{1}), \dots, {\hat{β}}_{λ}^{(m + 1)} (u_{n})}^{T}$ , with
${\hat{β}}_{λ}^{(m + 1)} (u_{t}) = {(\frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T} K_{h} (u_{t} - u_{i}) + D^{(m)})}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} x_{i} y_{i} K_{h} (u_{t} - u_{i})),$ (3.3)

where the matrix D⁽^m⁾ is a (d + 1) × (d + 1) diagonal matrix with the jth diagonal component $D_{j j}^{(m)} = {p_{λ}^{'} (‖ {\hat{b}}_{λ, j}^{(m)} ‖)} / {2 ‖ {\hat{b}}_{λ, j}^{(m)} ‖}$ .
Iterate step 2 for m = 1, 2, … until convergence.

We can adopt various penalty functions to obtain different D⁽^m⁾’s in (3.3). In this section, we consider the LASSO penalty, the adaptive LASSO penalty and the SCAD penalty. Specifically, the LASSO penalty (Tibshirani, 1996) yields $D_{j j}^{(m)} = λ / ‖ {\hat{b}}_{λ, j}^{(m)} ‖$ ; the adaptive LASSO (Zou, 2006) replaces λ with the coefficient-specific parameter, that is, $D_{j j}^{(m)} = λ_{j} / ‖ {\hat{b}}_{λ, j}^{(m)} ‖$ , where $λ_{j} = λ / ‖ {\hat{b}}_{λ, j}^{(0)} ‖$ ; and the SCAD penalty (Fan and Li, 2004) gives

D_{j j}^{(m)} = \frac{2}{2 ‖ {\hat{b}}_{j}^{(m)} ‖} {λ I (‖ {\hat{b}}_{j}^{(m)} ‖ \leq λ) + \frac{{(a λ - ‖ {\hat{b}}_{j}^{(m)} ‖)}_{+} \cdot I (‖ {\hat{b}}_{j}^{(m)} ‖ > λ)}{a - 1}} .

We next illustrate the proposed two-stage approach by an empirical analysis of FHS data.

Example 2

The Framingham Heart Study (FHS) is a cardiovascular study beginning in 1948 under the direction of the National Heart, Lung and Blood Institute (Dawber, et al., 1951; Jaquish, 2007). In our analysis, 349, 985 non-rare single-nucleotide polymorphisms (SNPs) are of interest, and the data from 977 subjects are available. The goal is to detect the SNPs that are important to explaining the body mass index (BMI). For each SNP, both dominant effect and additive effect are considered, thus the dimensionality is 699, 970, much larger than the sample size 977. In addition, one may argue that the effect of SNPs on BMI might change with age. Therefore, the varying coefficient model (2.1) is appropriate, where y is BMI, x is the SNP vector, and u is age.

To select the significant SNPs, the proposed two-stage approach is applied, based on three penalties: LASSO, Adaptive LASSO (ALASSO) and SCAD, along with three tuning parameter selection criteria: AIC, BIC and GCV. The model sizes of the nine selected models are tabulated in Table 4, in which the smaller models are nested within the bigger ones, and the same size indicates the identical model. Thus, there are only five different models out of the nine selected models. The median squared prediction error (MSPE) of the nine models are reported in the parentheses of Table 4. One can see that CC-SIS+SCAD two-stage approach yields the sparsest model with size 34 and the smallest MSPE. Furthermore, the pairwise likelihood ratio tests for the nested varying coefficient models (Fan, Zhang and Zhang, 2001) are conducted. The p-values are shown in Table 5, which indicate the sparsest model chosen by CC-SIS+SCAD is sufficient.

Table 4.

The sizes and MSPE of the nine models

	CC-SIS+LASSO	CC-SIS+ALASSO	CC-SIS+SCAD
AIC	43 (0.405)¹	40 (0.401)	34 (0.380)
BIC	42 (0.395)	38 (0.400)	34 (0.380)
GCV	43 (0.405)	40 (0.401)	34 (0.380)

Open in a new tab

The numbers in the parentheses are MSPE of the model.

Table 5.

The p-values of the pairwise generalized likelihood ratio tests

		H₁
		Unpenalized	LASSO-AIC	LASSO-BIC	ALASSO-AIC	ALASSO-BIC
H₀	LASSO-AIC	0.9952	·	·	·	·
	LASSO-BIC	0.9999	0.9462	·	·	·
	ALASSO-AIC	0.9999	0.9998	0.9995	·	·
	ALASSO-BIC	0.9999	0.9967	0.9854	0.7481	·
	SCAD	0.9999	0.9991	0.9965	0.9516	0.9268

Open in a new tab

Figure 1 is the plot of the estimated coefficient functions versus age, which depicts the age-dependent effects of the 34 chosen SNPs.

The estimated coefficient functions of the chosen 34 SNPs indexed by the SNP names. The A or D in the parentheses indicates which effect (A-additive, D-dominant) of the chosen SNP is significant.

3.3. Iterative CC-SIS

Similar to the SIS (Fan and Lv, 2008), the proposed CC-SIS has one major weakness. Since the screening procedure is based on a marginal utility, CC-SIS likely fails to identify those important predictors which are marginally irrelevant to the response, but contribute to the response jointly with other variables. To address this issue, we propose an iterative conditioning-correlation sure independence screening (ICC-SIS) for varying coefficient models. The ICC-SIS for choosing d predictors comprises the following steps:

Apply CC-SIS and select d₁ predictors with the highest $d_{1} {\hat{ρ}}_{j}^{*}$ values, denoted by = {x_1₁, …, x_{1_d₁}}, where d₁ ≤ d.
Denote X_s to be the n ×d₁ matrix of selected predictors, and X_r to be the complement of X_s with dimension n × (p − d₁). For any given u ∈ , compute the weighted projection matrix of X_r by $X_{proj} (u) = X_{r} - X_{s} {(X_{s}^{T} W (u) X_{s})}^{- 1} X_{s}^{T} W (u) X_{r}$ , where W(u) is the n × n diagonal weight matrix with the ith diagonal element $ω_{i} (u) = K_{h} (u_{i} - u) / {\sum_{i = 1}^{n} K_{h} (u_{i} - u)}$ . Note that the matrix X_proj (u) depends on u.
For each u_i, i = 1, …, n, and j = 1, …, p − d₁, compute the sample conditional correlation ρ̂(x_j_,_proj, y|u_i) by (2.3) and (2.4) using the jth column of X_proj (u_i) and y. The screening criterion for the jth remaining predictor is ${\hat{ρ}}_{j, proj}^{*} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ρ}}^{2} (x_{j, proj}, y ∣ u_{i})$ . Select d₂ predictors = {x_2₁, …, x_{2_d₂}} by ranking ${\hat{ρ}}_{j, proj}^{*}$ ’s, where d₁ + d₂ ≤ d.
Repeat step 2 and 3 until the kth step when d₁ + d₂ + … + d_k ≥ d. And the selected predictors are ∪ ∪ … ∪ .

In the algorithm of ICC-SIS, d₁, …, d_k are chosen by users according to the desired computational complexity. Two steps are often sufficient in practice to achieve a satisfactory result: The marginally important predictors are selected in the first step, and the jointly important but marginally uncorrelated predictors are identified afterwards. In addition, if d₁ = d, ICC-SIS becomes CC-SIS. Motivation of the ICC-SIS and some theoretical insights into the ICC-SIS are given in the supplementary material. We next examine the performance of the ICC-SIS by the following simulation example.

Example 3

This example demonstrates the advantage of ICC-SIS over CC-SIS when some covariates are jointly active in the presence of other covariates but are marginally uncorrelated with the response. We define the true model index set Inline graphic = {1, 2, 3, 4, 5}, and the nonzero coefficient functions as follows:

\begin{array}{l} β_{1} (u) = 2 + cos {π (6 u - 5) / 3}, β_{2} (u) = 3 - 3 u, β_{3} (u) = - 2 + 0.25 {(2 - 3 u)}^{3} \\ β_{4} (u) = sin (9 u^{2} / 2) + 1, β_{5} (u) = exp {3 u^{2} / (3 u^{2} + 1)} . \end{array}

Moreover, the correlation ρ in the covariance matrix of x is taken to be 0.4. Under this model setting, ${\hat{ρ}}_{3}^{*}$ is approximately 0, but x₃ is still jointly correlated with y according to the construction of β₃(u). Table 6 and 7 compare the performances of CC-SIS and two-step ICC-SIS. From the tables one can see that ICC-SIS is able to select x₃ which is easily overlooked by CC-SIS. The rankings of ${\hat{ρ}}_{j}^{*}$ ’s are not reported because in each iteration of ICC-SIS, the ${\hat{ρ}}_{j}^{*}$ ’s of the remaining predictors will change after the previously chosen predictors are removed from the X matrix.

Table 6.

The quantiles of M for Example 3.

	5%	25%	50%	75%	95%
CC-SIS	5.0	17.0	68.5	226.0	654.1
ICC-SIS	5.0	9.0	9.0	11.0	17.0

Open in a new tab

Table 7.

The selecting rates p_a and p_j ’s for Example 3.

	CC-SIS						ICC-SIS

d	p₁	p₂	p₃	p₄	p₅	p_a	p₁	p₂	p₃	p₄	p₅	p_a
16	1.00	1.00	0.24	1.00	1.00	0.24	1.00	1.00	1.00	0.99	1.00	0.99
32	1.00	1.00	0.36	1.00	1.00	0.36	1.00	1.00	1.00	1.00	1.00	1.00
48	1.00	1.00	0.43	1.00	1.00	0.43	1.00	1.00	1.00	1.00	1.00	1.00

Open in a new tab

4. Summary

In this paper we proposed a feature screening procedure CC-SIS specifically for ultrahigh dimensional varying coefficient models. The screening criterion ρ̂^* was constructed based on the conditional correlation which can be estimated by the kernel smoothing technique. We systematically studied the ranking consistency and sure screening property of CC-SIS, and conducted several numerical examples to verify them empirically. The Monte Carlo simulations also showed that CC-SIS can indeed be improved by the iterative algorithm ICC-SIS under certain situations. Furthermore, a two-stage approach, based on CC-SIS and modified penalized regressions, was developed to estimate sparse varying coefficient models with high dimensional covariates.

Supplementary Material

Supplementary Materials

NIHMS561310-supplement-Supplementary_Materials.pdf^{(144.1KB, pdf)}

Acknowledgments

Runze Li research was supported by National Institute on Drug Abuse (NIDA) grant P50-DA10075, National Cancer Institute (NCI) grant R01 CA168676 and National Natural Science Foundation of China grant 11028103.

Rongling Wu research was supported by a NSF grant IOS-0923975 and a NIH grant, UL1RR0330184. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, NIH, NIDA and NCI.

Contributor Information

Jingyuan Liu, Email: jingyuan1230@gmail.com, Assistant Professor of Wang Yanan Institute for Studies in Economics and Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, China.

Runze Li, Email: rzli@psu.edu, Distinguished Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111.

Rongling Wu, Email: RWu@phs.psu.edu, Professor, Department of Public Health Sciences, Penn State Hershey College of Medicine, Hershey, PA 17033.

References

Dawber TR, Meadors GF, Moore FE., Jr Epidemiological Approaches to Heart Disease: the Framingham Study. American Journal of Public Health. 1951;41:279–286. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Gijbels I. Local Polynomial Modeling and Its Applications. Chapman and Hall; New York, NY: 1996. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics. 2001;29:153–193. [Google Scholar]
Fan J, Zhang W. Simultaneous confidence bands and hypotheses testing in varying-coefficient models. Scandinavian Journal of Statistics. 2000;27:1491–1518. [Google Scholar]
Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]
He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]
Jaquish C. The Framingham heart study, on its way to becoming the gold standard for cardiovascular genetic epidemiology. BMC Medical Genetics. 2007;8:63. doi: 10.1186/1471-2350-8-63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li G, Peng H, Zhang J, Zhu L-X. Robust rank correlation based screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
Li R, Liang H. Variable selection in semiparametric regression model. The Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li R, Zhong W, Zhu LP. Feature screening via distance correlation learning. Journal of the American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association. 1995;90:1257–1270. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association. 2009;104:747–757. [Google Scholar]
Wang L, Li H, Huang J. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association. 2008;103:1556–1569. doi: 10.1198/016214508000000788. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu LP, Li L, Li R, Zhu LX. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive LASSO and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS561310-supplement-Supplementary_Materials.pdf^{(144.1KB, pdf)}

[R1] Dawber TR, Meadors GF, Moore FE., Jr Epidemiological Approaches to Heart Disease: the Framingham Study. American Journal of Public Health. 1951;41:279–286. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Fan J, Gijbels I. Local Polynomial Modeling and Its Applications. Chapman and Hall; New York, NY: 1996. [Google Scholar]

[R4] Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R5] Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]

[R6] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]

[R8] Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics. 2001;29:153–193. [Google Scholar]

[R9] Fan J, Zhang W. Simultaneous confidence bands and hypotheses testing in varying-coefficient models. Scandinavian Journal of Statistics. 2000;27:1491–1518. [Google Scholar]

[R10] Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]

[R11] He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics. 2013;41:342–369. [Google Scholar]

[R12] Jaquish C. The Framingham heart study, on its way to becoming the gold standard for cardiovascular genetic epidemiology. BMC Medical Genetics. 2007;8:63. doi: 10.1186/1471-2350-8-63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Li G, Peng H, Zhang J, Zhu L-X. Robust rank correlation based screening. The Annals of Statistics. 2012;40:1846–1877. [Google Scholar]

[R14] Li R, Liang H. Variable selection in semiparametric regression model. The Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Li R, Zhong W, Zhu LP. Feature screening via distance correlation learning. Journal of the American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Ruppert D, Sheather SJ, Wand MP. An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association. 1995;90:1257–1270. [Google Scholar]

[R17] Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R18] Wang H, Xia Y. Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association. 2009;104:747–757. [Google Scholar]

[R19] Wang L, Li H, Huang J. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. Journal of the American Statistical Association. 2008;103:1556–1569. doi: 10.1198/016214508000000788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Zhu LP, Li L, Li R, Zhu LX. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Zou H. The adaptive LASSO and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Feature Selection for Varying Coefficient Models With Ultrahigh Dimensional Covariates

Jingyuan Liu

Runze Li

Rongling Wu

Abstract

1. Introduction