Discussion of “Sure Independence Screening for Ultra-High Dimensional Feature Space

Hao Helen Zhang

doi:10.1111/j.1467-9868.2008.00674.x

. Author manuscript; available in PMC: 2009 Jul 13.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2008 Nov;70(5):903. doi: 10.1111/j.1467-9868.2008.00674.x

Discussion of “Sure Independence Screening for Ultra-High Dimensional Feature Space

Hao Helen Zhang ¹

PMCID: PMC2709408 NIHMSID: NIHMS114848 PMID: 19603084

Discussion

We congratulate the authors for their thought-provoking and fascinating work on a fundamental yet challenging topic in variable selection. Driven by the pressing need of high dimensional data analysis in many fields, the problem of dimension reduction without losing relevant information becomes increasingly important. Fan and Lv successfully tackled the extremely challenging case, where log(p) = O(n^ξ), ξ > 0. The proposed Sure Independence Screening (SIS) is a state of the art method for high dimensional variable screening: simple, powerful, and having optimal properties. This work is a substantial contribution to the area of variable selection and will also make a significant impact in other scientific fields..

Extension to nonparametric models

In linear models, marginal correlation coefficients between linear predictors and the response are effective measures to capture strength of their linear relationship. However, correlation coefficients generally do not work for ranking nonlinear effects. Consider the additive model,

Y_{i} = \sum_{j = 1}^{p} f_{j} (X_{i j}) + ε_{i}, i = 1, \dots, n,

where f_j takes an arbitrary nonlinear function form. Motivated by the ranking idea of the SIS, one could first fit a univariate smoother for each predictor and then use some marginal statistics to rank the covariates. Many interesting questions arise in this approach. Firstly, what are good measures to characterize the strength of the nonlinear relationship fully? Possible choices include nonparametric test statistics, p-values, and goodness-of-fit statistics like R². But which is best? Also, how do we develop the consistent selection theory for the procedure of screening nonlinear effects? All of these questions are challenging because of the complicated estimation that is involved in nonparametric modeling. It would be interesting to explore whether and how the SIS can be extended to this context.

Connection to multiple hypotheses testing and false discovery rate control

The variable selection problem can be regarded as the problem of testing multiple hypotheses: H₁ : β₁ = 0, …, H_p : β_p = 0. Screening important variables is hence equivalent to identifying the hypotheses to be rejected. The false discovery rate (Benjamini and Hochberg, 1995) has been developed to control the proportion of false rejections. Some consistent procedures based on individual tests of each parameter have been developed (Potscher 1983; Bauer et al. 1988). Recently, Bunea et al., (2006) considered the case when p increases with n, and showed the false discover rate or Bernoulli adjustment can lead to consistent selection of variables under certain conditions. Their method is based on the ordered p-values of individual t-statistics for testing H_j : β_j= 0, j = 1, …, p. It would be interesting to compare the SIS with these adjusted multiple hypotheses testing approaches.

References

Bauer P, Potscher BM, Hackl P. Model selection by multiple test procedures. Statistics. 1988;19:39–44. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple hypotheses testing. Journal of Royal Statistical Society, B. 1995;57:289–300. [Google Scholar]
Bunea F, Wegkamp M, Auguste A. Consistent variable selection in high dimensional regression via multiple testing. Journal of Statistical Planning and Inference. 2006;136:4349–4364. [Google Scholar]
Potscher B. Order estimation in ARMA models by Lagrange multiplier tests. Annals of Statistics. 1983;11:872–885. [Google Scholar]

[R1] Bauer P, Potscher BM, Hackl P. Model selection by multiple test procedures. Statistics. 1988;19:39–44. [Google Scholar]

[R2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple hypotheses testing. Journal of Royal Statistical Society, B. 1995;57:289–300. [Google Scholar]

[R3] Bunea F, Wegkamp M, Auguste A. Consistent variable selection in high dimensional regression via multiple testing. Journal of Statistical Planning and Inference. 2006;136:4349–4364. [Google Scholar]

[R4] Potscher B. Order estimation in ARMA models by Lagrange multiplier tests. Annals of Statistics. 1983;11:872–885. [Google Scholar]

PERMALINK

Discussion of “Sure Independence Screening for Ultra-High Dimensional Feature Space

Hao Helen Zhang

Discussion

Extension to nonparametric models

Connection to multiple hypotheses testing and false discovery rate control

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Discussion of “Sure Independence Screening for Ultra-High Dimensional Feature Space

Hao Helen Zhang

Discussion

Extension to nonparametric models

Connection to multiple hypotheses testing and false discovery rate control

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases