Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 22.
Published in final edited form as: J Am Stat Assoc. 2019 Jul 22;115(531):1393–1405. doi: 10.1080/01621459.2019.1632078

MODEL-FREE FORWARD SCREENING VIA CUMULATIVE DIVERGENCE

Tingyou Zhou a, Liping Zhu b, Chen Xu c, Runze Li d
PMCID: PMC7821979  NIHMSID: NIHMS1022713  PMID: 33487782

Abstract

Feature screening plays an important role in the analysis of ultrahigh dimensional data. Due to complicated model structure and high noise level, existing screening methods often suffer from model misspecification and the presence of outliers. To address these issues, we introduce a new metric named cumulative divergence (CD), and develop a CD-based forward screening procedure. This forward screening method is model-free and resistant to the presence of outliers in the response. It also incorporates the joint effects among covariates into the screening process. With a data-driven threshold, the new method can automatically determine the number of features that should be retained after screening. These merits make the CD-based screening very appealing in practice. Under certain regularity conditions, we show that the proposed method possesses sure screening property. The performance of our proposal is illustrated through simulations and a real data example.

Keywords: Cumulative divergence, feature screening, forward screening, high dimensionality, sure screening property, variable selection

1. INTRODUCTION

Regression analysis with ultrahigh dimensional covariates arises in many scientific fields such as agriculture, biomedicine, economics, finance, and genetics. It is desirable to identify the important covariates that are truly influential to the response. Traditional best subset selection methods are computationally infeasible in the presence of ultra-high dimensional covariates. In the past two decades, many regularization methods, such as LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), adaptive LASSO (Zou, 2006), and Dantzig selector (Candes and Tao, 2007), have been proposed for variable selection. However, when the covariates are ultrahigh dimensional, Fan et al. (2009) stated that these regularization methods suffer from the simultaneous challenges of computational expediency, statistical accuracy, and algorithmic stability.

To deal with ultrahigh dimensionality, Fan and Lv (2008) suggested screening out most unimportant covariates before implementing an elaborative variable selection. They proposed a sure independent screening procedure (SIS) for linear models using marginal Pearson correlation between each covariate and the response. Since the seminal work of Fan and Lv (2008), feature screening has received extensive attention in the past decade. In particular, Wang (2009) proposed a forward regression and Chang et al. (2013) suggested a marginal likelihood ratio test to screen out unimportant covariates in linear models. Li et al., (2012) suggested replacing Pearson correlation with Kendall’s rank correlation in the presence of outliers. Ma, Li and Tsai (2017) proposed quantile partial correlation for feature screening in linear quantile regression. Fan and Song (2010) and Xu and Chen (2014) suggested maximum likelihood estimate and Mai and Zou (2013) proposed Kolmogorov-Smirnov statistic to screen out unimportant features in generalized linear models. Fan et al. (2011) and He, Wang and Hong (2013) suggested nonparametric screening procedures for additive models. Song et al. (2014) proposed an independent screening procedure for varying coefficient models. These model-based screening procedures are effective if the working model is close to the underlying true model, and may be very ineffective otherwise.

To minimize the impact of model misspecification, several model-free screening methods have been developed. For instance, Zhu et al., (2011) proposed a sure independent ranking and screening procedure for a general class of index models. Li, Zhong and Zhu (2012) suggested distance correlation for feature screening, which can simultaneously deal with grouped covariates and multivariate response. Shao and Zhang (2014) introduced martingale difference correlation to perform screening as long as the mean function of the response is concerned. These model-free methods are favored when we are lack of prior information on the regression structure. However, most of them are based on marginal correlations and are vulnerable in the presence of outliers.

In the present work, we develop a model-free forward screening procedure for ultra-high dimensional data. Forward screening is related to but much more challenging than conditional screening. For conditional screening, the conditioning set is fixed. However, for our proposed forward screening procedure, the conditioning set is iteratively updated in a data-driven fashion. Moreover, existing conditional screening procedures are model-based and there is little literature on model-free conditional screening (Wang, 2009; Xu and Chen, 2014; Barut, Fan and Verhasselt, 2016). To the best of our knowledge, how to design a model-free forward screening has not been studied yet. We aim to fill in this gap in this paper. To this end, we first introduce the concept of cumulative divergence (CD), a new correlation metric to characterize functional dependence. We show that the CD is robust to the presence of outliers in the conditioning variable. We further propose a CD-based forward screening procedure. At each step of the forward screening, a new covariate will be added to an active index set based on its conditional CD with the response. This procedure stops when the conditional CD of all remaining covariates is less than a certain threshold. Compared with marginal screening methods, the forward screening incorporates the joint correlation among the covariates. With a data-driven threshold, our proposal can adaptively determine the number of features that should be retained after screening. Therefore, it is convenient for implementation without ad hoc tuning steps. Due to its robust property, our proposal performs well even when the underlying true model is misspecified. It is also robust in the presence of outliers. This appealing property makes the CD-based forward screening attractive for handling ultrahigh dimensional noisy data. Under some regularity conditions, we show that our forward screening method possesses the sure screening property in the terminology of Fan and Lv (2008). We further demonstrate the finite sample performance of the proposed procedures through simulations and a real data example.

We summarize the major contributions of this paper as follows. (1) The proposed forward screening approach is distinguished from marginal screening approaches in that the joint correlations among the covariates are taken into account by the proposed forward screening procedure and yet are ignored by the marginal screening methods (Zhu et al.,, 2011; Li, Zhong and Zhu, 2012). (2) The proposed forward screening procedure is model-free, and hence robust to model misspecification. Thus, the proposed procedure is different from existing model-based forward regression and conditional screening methods (Wang, 2009; Xu and Chen, 2014; Barut, Fan and Verhasselt, 2016). This model-free property is very appealing in ultrahigh dimensional data analysis, especially when we are often lack of information on the underlying regression structure. (3) We propose the CD to quantify deviation from mean independence. The CD is robust to the presence of outliers in the conditioning variable, is thus different from the martingale difference correlation (Shao and Zhang, 2014). Our proposed CD-based forward screening approach inherits this robustness property and is robust to the presence of outliers in the response.

This paper is organized as follows. In Section 2, we introduce the notion of cumulative divergence and study its properties. In Section 3, we propose a model-free forward screening procedure and establish its sure screening property. In Section 4, we assess the finite sample performance of our proposed forward screening procedure through comprehensive numerical studies. Some concluding remarks are given in Section 5. All technical details are relegated to the Appendix and a supplementary document.

2. THE CUMULATIVE DIVERGENCE

In each step of the forward screening procedure to be developed, we have to determine whether a covariate should be selected through testing whether the conditional mean function of the response variable is independent of this covariate. This motivates us to start with a simplified problem by testing mean independence that

H0:E(Y|X)=E(Y)almostsurelyversusH1:otherwise. (2.1)

Let (X˜, Y˜) be an independent copy of (X, Y). We assume var(X) > 0 and 0 < var(Y) < ∞ throughout. We do not require var(X) < ∞. Let “⇔” stand for “the statements on both the left and the right hand sides are equivalent”, and supp(X) stand for the support of the conditioning variable X. We first note that

E(Y|X)=E(Y)almostsurelyE(Y|X<x0)=E(Y),forallx0supp(X)cov{Y,1(X<x0)}=0,forallx0supp(X)E[cov2{Y,1(X<X˜)|X˜}]=0. (2.2)

This motivates us to define the cumulative covariance (CCov) and the CD as follows.

Definition 2.1. Assume var(X) > 0 and 0 < var(Y) < ∞. The cumulative covariance, denoted CCov(Y | X), and the cumulative divergence, denoted CD(Y | X), between random variables X and Y are defined, respectively, by

CCov(Y|X)=defE[cov2{Y,1(X<X˜)|X˜}]and (2.3)
CD(Y|X)=defCCov(Y|X)/var(Y). (2.4)

The definition of CD allows for var(X) = ∞, indicating that the distribution of X can be heavy-tailed. Since the rank of X is used in the definition of CCov(Y | X), this also indicates that CD(Y | X) is robust to outliers in the conditioning variable X. The following theorem states that the CD possesses several other appealing properties.

Theorem 1. The CD has the following properties.

  1. Assume var(X) > 0 and 0 < var(Y) < ∞, then 0 ≤ CD(Y | X) ≤ ¼ and CD(Y | X) = 0 if and only if E(Y | X) = E(Y) almost surely. In addition, CD(X | Y) = CD(Y | X) = 0 if F (y | X) = F (y) for all y, where F(y|X)=defpr(Y<y|X) and F(y) = pr(Y < y), for y.

  2. For a, b with a ≠ 0, and an arbitrary strictly monotone transformation M(X), CD(Y | X) = CD{aY + b | M (X)}.

  3. If X and Y are jointly normal with Pearson correlation ρ, then CD(Y|X)=CD(X|Y)=ρ2/(23π). In particular, CD(X|X)=1/(23π).

  4. Let X˜ be an independent copy of X. If Y is normal and all involved moments exist, CD(Y|X)/var(Y)=E[E2{F(X˜|Y)/Y|X˜}].

The first assertion of Theorem 1 indicates that CD(Y | X) is a useful measure to detect whether the conditional mean function of Y depends on X functionally. In particular, CD(Y | X) = 0 if and only if E(Y | X) = E(Y). This ensures that the CD is a useful tool to test (2.1). In general, CD(Y | X) ≠ CD(X | Y ) even if var(X) = var(Y). If X and Y are independent, then CD(Y | X) = CD(X | Y ) = 0; and if X and Y are jointly normal, CD(Y | X) = CD(X | Y ).

The second assertion of Theorem 1 indicates that the CD is invariant with respect to strictly monotone transformation of X. This invariant property matches the fact that E(Y | X) = E{Y | M (X)} and is however not shared by other popular correlation measures, such as Pearson correlation, martingale difference (Shao and Zhang, 2014), or distance correlation (Székely, Rizzo and Bakirov, 2007; Székely and Rizzo, 2009). This property implies that the CD is robust against model misspecification and the presence of outliers, because it merely uses the rank rather than the observed values of X. The virtue of robustness makes the associated forward screening procedure to be developed in Section 3 potentially attractive for ultrahigh dimensional noisy data.

The third assertion of Theorem 1 implies that, when X and Y are jointly normal with Pearson correlation ρ and unit variance, our proposed CD is closely related to other popular correlation measures through ρ. In particular, Kendall’s rank correlation (Huber and Ronchetti, 2009) equals to 2 arcsin(ρ)/π, the squared martingale difference correlation equals to ρ2{4(13+π/3)}1/2, and the squared distance correlation is {ρarcsin(ρ)+(1ρ2)1/2ρarcsin(ρ/2)(4ρ2)1/2+1}/(1+π/33).

A sample version of the CD can be conveniently constructed. Specifically, let {(Xi, Yi), i = 1, , n} be a random sample from the joint distribution of (X, Y). We estimate CCov(Y | X) and CD(Y | X) respectively by

CCov^(Y|X)=defn3j=1n[i=1n(YiY¯){1(Xi<Xj)Fn(Xj)}]2andCD^(Y|X)=defCCov^(Y|X)/var^(Y), (2.5)

where

Y¯=defn1i=1nYi,Fn(Xj)=defn1i=1n1(Xi<Xj)andvar^(Y)=defn1i=1n(YiY¯)2.

To decide critical values in the test for the hypothesis (2.1), we propose a wild bootstrap procedure as follows. Define εi=YiY¯ and Yi=Y¯+aiεi, where ai satisfies pr(ai = 1) = pr(ai = −1) = ½. The wild bootstrap sample is {(Xi,Yi),i=1,n}. We repeat the wild bootstrap procedure m times to obtain CD^(1)(Y|X),,CD^(m)(Y|X). Denote τ the (1 – α)-th quantile of CD^1(Y*|X),,CD^m(Y*|X). We reject H0 at the significance level α if CD(Y | X) calculated from the original sample {(Xi, Yi), i = 1, , n} is greater than τ and accept H0 otherwise.

We conduct a simulation study to compare the finite-sample performance of the CD with that of four commonly-used correlation: Pearson correlation, rank correlation, distance correlation and martingale difference correlation. We consider two scenarios for generating the conditioning variable X. In the first scenario X is standard normal and in the second scenario X follows Cauchy distribution. Let Y = c exp(−X2) + ε, where ε ~ N(0, 1). We set c = 0.0, 0.5, 1.0, 1.5 and 2.0. The null hypothesis H0 in (2.1) holds true when c = 0. We set the sample size n = 100 and summarize the simulation results in Figure 1 when the significance level α = 0.05.

Figure 1.

Figure 1

The power curves of the Pearson correlation test (dashed line marked with circles), the Kendall’s rank correlation test (dotted line marked with plus signs), the martingale difference correlation test (dotdash line marked with cross signs), the distance correlation test (longdash line marked with diamond signs) and the cumulative divergence test (solid line marked with star signs), respectively. In Figure 1 (A), both X and ε are standard normal. In Figure 1 (B), X follows Cauchy distribution and ε is standard normal.

It can be clearly seen from Figure 1 that the sizes of all tests are close to the significance level α = 0.05. Both the Pearson correlation and the rank correlation test fail to detect the non-monotone mean dependence. The CD test is much more powerful than both the martingale difference correlation test and the distance correlation test when X follows Cauchy distribution. This simulated example empirically confirms that the robustness property of the CD test.

3. A FORWARD SCREENING PROCEDURE

In this section we propose a model-free forward screening procedure based on the CD. This new forward screening procedure inherits the appealing properties of the CD.

To ease subsequent presentation, we introduce the following notations. Let Y be the response and x = (X1, …, Xp)T be the p-dimensional covariate vector. Let F be a working index set and Fc be its complement. Both F and Fc are subset of {1,2, …, p}. We define xF=def{Xk,kF} the covariate vector indexed by F and ΣF=defvar(xF). Let |F| stand for the cardinality of F. We assume thoughtout that E(x) = 0 for simplicity.

The goal of feature selection is to identify the smallest index set A such that

Yx|xA, (3.1)

where ⫫ stands for statistical independence. Model (3.1) implies immediately that F(y|x)=F(y|xA), for y. Therefore, identifying xA which satisfies model (3.1) is equivalent to seeking for the smallest index set

A=def{k:F(y|x)}dependsfunctionallyonXkfory,k=1,,p}.

Model (3.1) covers a wide variety of existing models. Interested readers can refer to Section 2.1 of Zhu et al., (2011) for details.

We first note that model (3.1) ensures that YXk|xF for all kFc and all AF. Therefore, given a working index set F, assessing whether Xk, kFc, is truly important for the response variable Y amounts to testing the hypothesis that

H0:YXk|xFversusH1:otherwise. (3.2)

The law of iterated expectations implies immediately that E{XkE(Xk|xF)|Y}=E{E(Xk|xF,Y)E(Xk|xF)|Y}. Under H0 in (3.2), E(Xk|xF,Y)=E(Xk|xF), and hence E{XkE(Xk|xF)|Y}=0. Under H1 in (3.2), Xk is dependent upon Y even when xF is given. Thus it is reasonable to expect that E(Xk|xF,Y)E(Xk|xF) and accordingly E{XkE(Xk|xF)|Y}0. These, together with Theorem 1, motivate us to use ωk|F=defCD{XkE(Xk|xF)|Y} to test (3.2).

To ensure that ωk|F has nontrivial power in test for (3.2), we further assume that

A1.E{F(y|x)/Xk}0forsomey,forallkA.

It is remarkable that (3.1) ensures that E{∂F(y | x)/∂Xk} = 0 for all y, and all kAc. This fact, together with Assumption A1, ensures that the important and the unimportant covariates are separable, which is stated in Theorem 2.

Theorem 2. Under H0 in (3.2), we have ωk|F=0. If we further assume that x is normal and Assumption A1 holds, then minF:FcAmaxkFcAωk|F>0.

Theorem 2 guarantees that, if all the truly important covariates have been selected into F already, then for any kFc, we have ωk|F=0. However, if there are a few important covariates that have not been found yet, that is, FcA, then there must exist kFcA such that maxkFcAωk|F>0. This motivates us to reject H0 in (3.2) when the sample version of maxkFcAωk|F is sufficiently large.

How to estimate ωk|F is a nontrivial task because it involves estimating E(Xk|xF). A fully nonparametric estimate of E(Xk|xF) is apparently undesirable, especially when xF is high dimensional. In the present context, we assume that

A2.E(Xk|xF)=gk|F(xF,βk|F),wheregk|Fisknownandβk|Fisunknown.

We allow E(Xk|xF) to be a general parametric function. When x follows elliptically contoured distribution, E(Xk|xF) is indeed a linear function of xF, for all k and F{1,,p}. Examples of elliptically contoured distribution include multivariate normal distribution, multivariate t-distribution, symmetric multivariate Laplace distribution and multivariate logistic distribution, etc.

Let {(xi, Yi), i = 1, …, n} be a random sample from (x, Y), where each covariate, for notational clarity, is assumed to be marginally standardized to have zero mean and unit variance in advance. To carry out the CD test for (3.2), we estimate ωk|F by

ω^k|F=n2j=1n[i=1n1(Yi<Yj){Xikgk|F(xiF,β^k|F)}]2/i=1n{Xikgk|F(xiF,β^k|F)}2,

where β^k|F is obtained through the nonlinear least squares. That is,

β^k|F=defargminβk|Fi=1n{Xikgk|F(xiF,βk|F)}2 (3.3)

We reject H0 in (3.2) when ω^k|F is sufficiently large. Deciding the critical value for the CD test amounts to studying the asymptotic distribution of ω^k|F. Let gk|F(xF,βk|F), gk|F(xF,βk|F) and gk|F(xF,βk|F) be the first, the second and the third derivatives of gk|F(xF,βk|F) with respect to βk|F, respectively. We denote gl1,k|F(xF,βk|F) the l1-th component of gk|F(xF,βk|F), gl1l2,k|F(xF,βk|F) the (l1, l2)-th component of gk|F(xF,βk|F) and gl1l2l3,k|F(xF,βk|F) the (l1, l2, l3)-th component gk|F(xF,βk|F). Let δk|F=defXkE(Xk|xF) and C be a generic constant. We assume the following conditions.

  • (B1)

    There exists ϑ > 0 such that p = o{exp(anϑ)} for any a > 0.

  • (B2)

    For any working index set F{1,2,,p} and kFc, E(Xk4)C, E(δk|F8)C, E{|gl1,k|F(xF,βk|F)|2}C;|gk|F(xF,βk|F)|Gk|F(xF) with E[{Gk|F|(xF)}4]C;|gl1,k|F(xF,βk|F)|Gl1,k|F(xF) with E[{Gl1,k|F|(xF)}4]C;|gl1l2,k|F(xF,βk|F)|Gl1l2,k|F(xF) with E[{Gl1l2,k|F|(xF)}4]C;|gl1l2l3,k|F(xF,βk|F)|Gl1l2l3,k|F(xF) with E[{Gl1l2l3,k|F|(xF)}4]C, for all l1, l2, l3 and βk|F.

  • (B3)

    There exists c0 such that F1<c0 for all p, where A=defmaxlm|alm| stands for the infinity norm of the matrix A = (alm).

Condition (B1) allows p to diverge exponentially faster than n. Condition (B2) is widely used to study the asymptotic behavior of nonlinear least squares estimation. See, e.g., Jennrich (1969) and White (1981). This condition can be simplified dramatically when gk|F(xF,βk|F) is linear. Theorem 3 requires condition (B3) holds true for |F|=o(n1/5). Many precision matrices satisfy Condition (B3). In particular, if we denote F1=(σ1,lm)|F|×|F| and let σ−1,lm equal 1 if l = m and rn otherwise, then this condition is satisfied as long as |rn|(c01)/(|F|1). If F1 is a banded or block-diagonal matrix and each row has d nonzero entries, for example, σ−1,lm equals 1 if l = m, r if 1 ≤ |lm| < d and 0 if |lm| ≥ d, condition (B3) simply requires |r| ≤ (c0 −1)/(d−1). Condition (B3) can also be satisfied by many other sparse precision matrices. If F1 is a power-decay matrix, say, σ1,lm=ρn|lm|, for |ρn| < 1, condition (B3) is satisfied as long as (1ρn|F|)c0(1ρn). Condition (B3) is also implied by Σ1<c0. Similar conditions are also assumed in the literature. See, for example, Mai et al. (2012, page 34–35) and Bickel and Levina (2008, page 2580).

Theorem 3. In addition to Conditions (B1)-(B3), we further assume |F|=o(n1/5).

  1. Under H0 in (3.2), we have, ωk|F=0 and pr(nω^k|F<qKk|F)pr(Qk|F<q)0, for any q+, where Qk|F=defj=1λj,k|Fχj2(1), Kk|F is defined in (B.1), χj2(1)s are independent χ2(1) random variables, λj,k|Fs are nonnegative constants that depend on the joint distribution of (Xk,xF,Y) and E(Qk|F)=1.

  2. Under H1 in (3.2) and if ωk|F>0 for kFc, we have pr{n1/2(ω^k|Fωk|F)<t}pr(Tk|F<t)0, for any t, where Tk|F is a normal random variable with mean zero and variance Δk|F and Δk|F is defined in (B.3).

The condition |F|=o(n1/5) seems somewhat stringent. By refining Assumption A2, such as E(Xk|xF)=gk|F(xFTβk|F), this condition can be weakened to |F|=o(n1/3). This condition is in line with that of Huber (1973), Fan and Peng (2004) and Tan and Zhu (2018). We impose this condition because βk|F is unknown and has to be estimated from data. We do not impose sparsity assumption on βk|F but we do require the convergence rate of β^k|F be fast enough to ensure the weak convergence of ω^k|F. The requirement on β^k|F can be met under the condition that |F|=o(n1/5).

Theorem 3 shows that ω^k|F is root-n consistent under H0 and n-consistent under H1, indicating that the CD test has nontrivial power in test for (3.2). We adopt the wild bootstrap procedure introduced in Section 2 to determine critical values.

Next we adapt our proposed CD test for (3.2) with a working index set F to a forward screening procedure for ultrahigh dimensional feature selection in model (3.1). The rationale of our proposed forward screening procedure is as follows. If H0 in (3.2) is rejected, we update F with F{k}, because Xk is possibly influential for Y. With the updated F, we further consider testing (3.2) until H0 is accepted for all kFc. It is reasonable to expect AF when the forward screening procedure stops. To provide theoretical justification for our proposal, we assume the following condition.

A3. There exist a positive constant C and ϖ ∈ [0, 1/2) such that

minF:FcAmaxkFcAωk|F>Cnϖ. (3.4)

Assumption A3 requires that the signal strength of the truly important covariates, conditional on the covariates xF that have already been selected, is strong enough to be detectable. It is also justified in Theorem 2. This assumption is different from the marginal signal assumptions used in the screening literature in that the marginal signal strength is quantified through setting the working index set F to be a null set. See, for example, condition 3 in Fan and Lv (2008), condition E in Fan and Song (2010), condition C in Fan et al. (2011), condition (C1) in Zhu et al., (2011), and condition (C2) in Li, Zhong and Zhu (2012). It is generally required that the marginal signal strength of all truly important covariates must be greater than a certain threshold in the existing screening literature. By contrast, Assumption A3 quantifies the signal strength of the truly important covariates conditional on the selected covariates xF, which ensures that ωk|F plays a similar role as the regression coefficients in linear models. Similar assumptions are also made in the literature. See, for example, condition (C3) in Wang (2009, page 1513) and condition 1 in Barut, Fan and Verhasselt (2016, page 1270). These assumptions are generally regarded as mild and reasonable.

To establish the sure screening property for the proposed screening procedure, we further assume the following conditions.

  • (B4)

    The cardinality of A satisfies |A|=O(n1/5γ) for γ ∈ (0, 1/5].

  • (B5)

    Let M and υ be two generic positive constants. Assume that E|Xk|mm!Mm−2υ/2 for all m ≥ 2, k = 1, 2, , p. Assume in addition that E|gk|F(xF,β0,k|F)|mm!Mm2υ/2 and E|gl1,k|F(xF,β0,k|F)|mm!Mm2υ/2 for all m ≥ 2, F{1,2,,p} and kFc.

Assumption (B4) is also a technical condition and is closely related to the assumption that |F|=o(n1/5) used in Theorem 3. Condition (B5) is milder than the sub-Gaussian assumption (Buldygin and Kozachenko, 1980, Lemma 1).

Theorem 4. Suppose that Conditions (B1)-(B5) and Assumption A3 are satisfied. If we further assume |F|=o(n1/5), 3/5 − 2ϖϑ > 0 and set ν<Cnϖ/2 in the forward screening procedure, then pr(minF:FcAmaxkFcAω^k|F>ν)1 as n → ∞.

Theorem 4 ensures that the proposed procedure can retain all important covariates with an overwhelming probability if ν is chosen properly. Such a desirable property is referred to as the sure screening property. The CD is a robust correlation metric, and our forward screening procedure is also robust to model misspecification. Such merits are particularly appealing for analyzing ultrahigh dimensional data in the absence of prior knowledge of model structure and data quality. Unlike existing model-free marginal screening methods, the proposed method is a stepwise procedure, which incorporates joint correlation among ultrahigh dimensional features in the forward screening process. It thus provides more reliable results in practice. With a data-driven choice of ν, the procedure adaptively determines the number of features to be retained after selection. This makes the implementation of our proposed forward screening method practically convenient, since our proposal does not require additional ad hoc tuning steps.

We describe the algorithm for our proposed forward screening procedure as follows.

Step 1 Start with an initial index set F=.

Step 2 For all kFc, calculate ω^k|F. Denote k=argminkFcω^k|F. If ω^k|F>ν, update F with F{k}. The data-driven ν will be determined as follows.

  1. Generate X˜ik=gk|F(xiF,β^k|F)+aiδ^i,k|F, i = 1, 2, …, n, where δ^i,k|F=Xikgk|F(xiF,β^k|F), and ai are independent and identically distributed random weights satisfying pr(ai = 1) = pr(ai = −1) = 1/2. We calculate ω˜^k|F=defCD^{X˜kE(X˜k|xF)|Y} using {(X˜ik,xiF,Yi),i=1,2,,n}.

  2. Repeat the above wild boostrap procedure for B times to obtain ω˜^k|F(1), ω˜^k|F(2), …, ω˜^k|F(B). Set ν to be the (1–α)-th upper quantile of {ω˜^k|F(1),ω˜^k|F(2),,ω˜^k|F(B)}. We update the working index set F with F{k} if ω^k|F>ν.

Step 3 Repeat Step 2 until no covariate can be added into the working index set F.

Assumption A2 requires that the minimal signal strength be greater than Cnϖ, and Theorem 4 requires the cutoff ν to be smaller than one half of the minimal signal strength. These requirements ensure that our proposal possesses the desirable sure screening property. In practice, however, the magnitude of minimal signal strength is generally unknown. Consequently, how to choose an optimal cutoff ν is not straightforward. To put our proposed procedure into practice, at each step and for each covariate, we choose α = 0.01 and set the cutoff to be the 99-th percentile of asymptotic null distribution of ω^k|F in our algorithm. This works satisfactorily in our numerical studies.

4. NUMERICAL STUDIES

4.1. Simulations

In this section, we conduct Monte Carlo simulations to assess the finite sample performance of the CD-based forward screening procedure. For convenience of presentation, we refer to our proposed forward screening method as C-FS. We compare C-FS with the following five competitors: the forward regression designed for linear model by Wang (2009, FR), the least absolute shrinkage and selection operator proposed by Tibshirani (1996, LASSO), the sure independent ranking and screening procedure proposed by Zhu et al., (2011, SIRS), the distance correlation based sure independence screening procedure proposed by Li, Zhong and Zhu (2012, DC-SIS), and the Pearson correlation based sure independence screening procedure proposed by Fan and Lv (2008, SIS).

To determine the number of features to be retained after screening, we use a BIC-type criterion for FR, as suggested by Wang (2009). The model size (tuning parameter) of the LASSO was chosen by 10-fold cross validation. For SIRS, DC-SIS and SIS, we follow the convention by retaining [n/log(n)] top ranked covariates into the screened model. It should be noted that our C-FS algorithm automatically determines the screening size with a wild bootstrap procedure.

We adopt the following criteria to evaluate the performance of above methods.

  1. Pind: With a given size, Pind is the empirical probability that an influential covariate is retained after screening.

  2. Pall: With a given size, Pall is the empirical probability that all the influential covariates are retained after screening.

  3. FPR: Let A^ be the index set of the retained covariates and A be the index set of truly influential covariates. The false positive rate (FPR) is defined as |A^\A|/|Ac|, where A^\A is the index set of irrelevant covariates that are retained after screening and |M| denotes the cardinality of the set M.

  4. TPR: The true positive rate (TPR) is defined as |A^A|/|A|, where A^A denotes the set of influential covariates that are correctly retained after screening

We report both the mean and the standard errors of the FPR and TPR values based on 500 repetitions. We set the sample size n = 200, the covariate dimension p = 3, 000 and the bootstrap times B = 1, 000.

Example 1. We generate data from a linear model Y = βTx + c0ε, where β = (5, 5, 5, −15ρ½, 0, …, 0)T, c0 = 1 if εN (0, 1) and c0 = 0.1 if εt(1). We consider the following two scenarios to generate the covariate x = (X1, …, Xp)T.

  1. The elliptical case: The covariate x is drawn from multivariate normal population with mean zero and covariance matrix Σ = (σij)p×p, where σii = 1, i = 1, …, p, σi4 = σ4i = ρ½ for i ≠ 4, and σij = ρ, for i ≠ j, i ≠ 4 and j ≠ 4.

  2. The non-elliptical case: Set x=Σ1/2{var^(z)}1/2{zE^(z)}, where Σ is defined in the first scenario, z=def(Z1,,Zp)T, Zks are independent of each other and follow χ2(2) distribution.

In the above two scenarios, we set ρ to be 0.1, 0.5 and 0.9, respectively, to stand for small, moderate and high correlation. This example was also used by Fan and Lv (2008) and Zhu et al., (2011). The simulation results are summarized in Tables 12.

Table 1:

The mean and the standard errors of both the FPR and the TPR values based on 500 repetitions for Example 1.

ε method ρ = 0.1 ρ = 0.5 ρ = 0.9



FPR TPR FPR TPR FPR TPR



mean std mean std mean std mean std mean std mean std
When x follows elliptical distribution

N(0,1) C-FS 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
FR 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
LASSO 0.01 0.00 1.00 0.00 0.06 0.00 0.75 0.00 0.04 0.00 0.75 0.00
SIRS 0.00 0.00 0.75 0.00 0.01 0.00 0.75 0.01 0.01 0.00 0.59 0.29
DC-SIS 0.00 0.00 0.75 0.00 0.01 0.00 0.75 0.02 0.01 0.00 0.58 0.29
SIS 0.00 0.00 0.75 0.00 0.00 0.00 0.75 0.00 0.01 0.00 0.60 0.28

t(1) C-FS 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
FR 0.00 0.00 0.90 0.29 0.00 0.00 0.90 0.29 0.00 0.00 0.77 0.42
LASSO 0.00 0.00 0.84 0.35 0.03 0.02 0.63 0.27 0.02 0.02 0.52 0.34
SIRS 0.01 0.00 0.75 0.01 0.01 0.00 0.75 0.02 0.01 0.00 0.62 0.27
DC-SIS 0.01 0.00 0.75 0.04 0.01 0.00 0.74 0.06 0.01 0.00 0.59 0.28
SIS 0.01 0.00 0.72 0.14 0.01 0.00 0.71 0.16 0.01 0.00 0.51 0.33

When x follows non-elliptical distribution

N(0,1) C-FS 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
FR 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
LASSO 0.01 0.01 1.00 0.00 0.06 0.00 0.75 0.00 0.04 0.00 0.75 0.00
SIRS 0.01 0.00 0.75 0.02 0.01 0.00 0.74 0.05 0.01 0.00 0.59 0.28
DC-SIS 0.01 0.00 0.75 0.02 0.01 0.00 0.74 0.05 0.01 0.00 0.59 0.27
SIS 0.01 0.00 0.75 0.02 0.01 0.00 0.74 0.04 0.01 0.00 0.62 0.25

t(1) C-FS 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00
FR 0.00 0.00 0.94 0.21 0.00 0.00 0.93 0.21 0.00 0.00 0.85 0.30
LASSO 0.00 0.00 0.86 0.32 0.03 0.02 0.62 0.28 0.02 0.02 0.49 0.35
SIRS 0.00 0.00 0.75 0.00 0.01 0.00 0.74 0.05 0.01 0.00 0.57 0.29
DC-SIS 0.01 0.00 0.75 0.04 0.01 0.00 0.74 0.06 0.01 0.00 0.56 0.30
SIS 0.01 0.00 0.71 0.15 0.01 0.00 0.69 0.18 0.01 0.00 0.50 0.32

Table 2:

The empirical probabilities Pind and Pall based on 500 repetitions for Example 1.

ε method ρ = 0.1 ρ = 0.5 ρ = 0.9






Pind Pall Pind Pall Pind Pall






X1 X2 X3 X4 ALL X1 X2 X3 X4 ALL X1 X2 X3 X4 ALL
When x follows elliptical distribution

N(0,1) C-FS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
FR 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
LASSO 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00
SIRS 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.79 0.79 0.79 0.00 0.00
DC-SIS 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.76 0.77 0.78 0.00 0.00
SIS 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.81 0.80 0.81 0.00 0.00

t(1) C-FS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
FR 0.90 0.90 0.90 0.89 0.89 0.90 0.90 0.91 0.90 0.89 0.77 0.77 0.77 0.77 0.76
LASSO 0.85 0.85 0.86 0.79 0.79 0.84 0.83 0.83 0.00 0.00 0.70 0.77 0.77 0.00 0.00
SIRS 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.83 0.82 0.83 0.00 0.00
DC-SIS 0.99 1.00 0.99 0.00 0.00 0.99 0.99 0.99 0.00 0.00 0.80 0.78 0.78 0.00 0.00
SIS 0.95 0.96 0.96 0.00 0.00 0.95 0.95 0.94 0.00 0.00 0.69 0.68 0.68 0.00 0.00

When x follows non-elliptical distribution

N(0,1) C-FS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
FR 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
LASSO 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
SIRS 1.00 1.00 1.00 0.01 0.01 0.99 0.99 0.99 0.00 0.00 0.78 0.80 0.80 0.00 0.00
DC-SIS 1.00 1.00 1.00 0.01 0.01 0.99 0.99 0.99 0.00 0.00 0.77 0.80 0.80 0.00 0.00
SIS 1.00 1.00 1.00 0.01 0.99 1.00 0.99 0.00 0.00 0.00 0.83 0.83 0.84 0.00 0.00

t(1) C-FS 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
FR 0.94 0.94 0.95 0.93 0.91 0.93 0.93 0.93 0.95 0.90 0.83 0.82 0.83 0.91 0.78
LASSO 0.88 0.87 0.88 0.80 0.80 0.83 0.83 0.82 0.00 0.00 0.65 0.66 0.65 0.00 0.00
SIRS 1.00 1.00 1.00 0.00 0.00 0.99 0.99 0.99 0.00 0.00 0.76 0.76 0.76 0.00 0.00
DC-SIS 1.00 1.00 1.00 0.00 0.00 0.99 0.99 0.99 0.00 0.00 0.75 0.74 0.74 0.00 0.00
SIS 0.94 0.94 0.94 0.96 0.00 0.92 0.93 0.92 0.00 0.00 0.65 0.66 0.69 0.00 0.00

In this example, X4 is marginally independent of Y. It is thus not surprising to observe from Table 2 that SIRS, DC-SIS and SIS fail to retain X4, as they consider only the marginal effects. The performance of LASSO is decent for ρ = 0.1 and ε~N(0,1), but it deteriorates sharply as ρ increases. Both FR and C-FS perform well when ε is normal. However, when εt(1) and x is elliptical, FR has an average TPR as low as 0.77 for ρ = 0.9. In this scenario, FR is also quite unstable in terms of the large standard deviations. By contrast, the proposed C-FS attains stable and satisfactory performance in all scenarios. The simulation results when x follows non-elliptical distribution are quite similar to those when x follows elliptical distribution.

Example 2. We consider three models where Y depends on xA nonlinearly.

  1. Y = X1 + 0.8X2 + 0.6X3 + 0.4X4 + 0.2 exp(X20 + c0ε).

  2. Y = X1 + 0.8X2 + 0.6(X5 + 1)2 + 0.4X103 + 0.2 exp(|X20 + 1|+ c0ε).

  3. Y = β1(X1)X2 + β2(X1)X3 + β3(X1)X4 + β4(X1)X5 + c0ε.

In all three models, ε and c0 are generated in the same way as in Example 1. In Example 2(a) and 2(b), we consider two scenarios for generating x.

  1. The elliptical case: The covariate x is drawn from multivariate normal population with mean zero and covariance matrix Σ=(0.5|ij|)p×p.

  2. The non-elliptical case: Set x=Σ1/2{var^(z)}1/2{zE^(z)}, where Σ=(0.5|ij|)p×p, z=def(Z1,,Zp)T, Zks are independent and follow χ2(2) distribution.

In Example 2(c), we generate U1 and U2 independently from uniform distribution on [0, 1], and set X1 = (U1 + U2)/2, β1(X1)=exp4(1X12), β2(X1) = 3{1 + sin(2πX1)}, β3(X1) = 2{1 +(1 − X1)3/2}, and β4(X1) = exp(|X1|). Define Xk = (Zk + 3U1)/4, k = 2, 3, …, p, where Zks are independently drawn from (1) the standard normal distribution in the elliptical case and (2) the χ2(2) distribution in the non-elliptical case.

The simulation results for Example 2 are charted in Tables 34. In Example 2(a), none of SIRS, DC-SIS or SIS is able to identify X20 as an important covariate. This is because these methods are relatively sensitive to the transformation of variables. When εt(1), the new C-FS is the only method that has satisfactory performance. This confirms the robustness behavior of C-FS. In Example 2(b), those model based methods, such as FR, LASSO and SIS, fail to retain all the important covariates, because the linear model assumption is violated. In comparison, the model free methods (C-FS, SIRS, DC-SIS) perform relatively better. The performance of C-FS is the best, due to its robustness property against the outliers in the response. In Example 2(c), the marginal screening methods, such as SIRS, DC-SIS and SIS, fail to detect X1 in all scenarios. Both C-FS and LASSO outperform FR when ε is normal, our C-FS is the only method that remains satisfactory performance when εt(1).

Table 3:

The mean and the standard errors of both the FPR and the TPR values based on 500 repetitions for Example 2.

method x follows elliptical distribution x follows non-elliptical distribution


ε~N(0,1) εt(1) ε~N(0,1) εt(1)




FPR TPR FPR TPR FPR TPR FPR TPR








mean std mean std mean std mean std mean std mean std mean std mean std
(a) C-FS 0.01 0.00 0.98 0.06 0.01 0.00 0.98 0.07 0.01 0.00 0.97 0.07 0.01 0.00 0.98 0.07
FR 0.00 0.00 0.88 0.18 0.00 0.00 0.25 0.40 0.00 0.00 0.63 0.28 0.00 0.00 0.21 0.34
LASSO 0.00 0.00 0.83 0.28 0.00 0.00 0.22 0.38 0.00 0.00 0.45 0.44 0.00 0.00 0.14 0.32
SIRS 0.01 0.00 0.90 0.10 0.01 0.00 0.86 0.09 0.01 0.00 0.98 0.07 0.01 0.00 0.97 0.08
DC-SIS 0.01 0.00 0.91 0.10 0.01 0.00 0.44 0.42 0.01 0.00 0.98 0.08 0.01 0.00 0.48 0.46
SIS 0.01 0.00 0.94 0.10 0.01 0.00 0.32 0.41 0.01 0.00 0.84 0.27 0.01 0.00 0.31 0.42

(b) C-FS 0.00 0.00 0.98 0.06 0.00 0.00 1.00 0.03 0.00 0.00 0.99 0.04 0.00 0.00 1.00 0.02
FR 0.00 0.00 0.60 0.28 0.00 0.00 0.19 0.35 0.00 0.00 0.44 0.23 0.00 0.00 0.14 0.24
LASSO 0.00 0.00 0.54 0.42 0.00 0.00 0.17 0.35 0.00 0.00 0.15 0.29 0.00 0.00 0.04 0.17
SIRS 0.01 0.00 0.96 0.08 0.01 0.00 0.96 0.08 0.01 0.00 0.99 0.04 0.01 0.00 1.00 0.03
DC-SIS 0.01 0.00 0.98 0.06 0.01 0.00 0.41 0.46 0.01 0.00 0.96 0.14 0.01 0.00 0.39 0.46
SIS 0.01 0.00 0.93 0.15 0.01 0.00 0.29 0.41 0.01 0.00 0.60 0.30 0.01 0.00 0.22 0.33

(c) C-FS 0.01 0.00 0.95 0.09 0.01 0.00 0.99 0.05 0.01 0.00 0.94 0.11 0.01 0.00 0.98 0.06
FR 0.00 0.00 0.89 0.12 0.00 0.00 0.48 0.40 0.00 0.00 0.88 0.14 0.00 0.00 0.49 0.39
LASSO 0.00 0.00 0.96 0.09 0.00 0.00 0.43 0.42 0.00 0.00 0.93 0.12 0.00 0.00 0.43 0.42
SIRS 0.01 0.00 0.73 0.10 0.01 0.00 0.76 0.09 0.01 0.00 0.70 0.13 0.01 0.00 0.73 0.10
DC-SIS 0.01 0.00 0.72 0.11 0.01 0.00 0.74 0.12 0.01 0.00 0.69 0.12 0.01 0.00 0.70 0.13
SIS 0.01 0.00 0.73 0.10 0.01 0.00 0.54 0.29 0.01 0.00 0.72 0.11 0.01 0.00 0.53 0.28

Table 4:

The empirical probabilities Pind and Pall based on 500 repetitions for Example 2.

method ε~N(0,1) εt(1)


Pind Pall Pind Pall
When x follows elliptical distribution

X1 X2 X3 X4 X20 ALL X1 X2 X3 X4 X20 ALL

(a) C-FS 1.00 1.00 0.99 0.93 0.97 0.90 1.00 1.00 1.00 0.97 0.91 0.90
FR 0.96 0.97 0.91 0.62 0.96 0.58 0.28 0.29 0.27 0.20 0.20 0.19
LASSO 0.92 0.94 0.90 0.70 0.71 0.62 0.25 0.25 0.24 0.19 0.15 0.15
SIRS 1.00 1.00 1.00 1.00 0.48 0.48 1.00 1.00 1.00 1.00 0.32 0.32
DC-SIS 1.00 1.00 1.00 1.00 0.56 0.56 0.53 0.53 0.52 0.48 0.14 0.14
SIS 1.00 1.00 1.00 0.99 0.71 0.71 0.37 0.37 0.37 0.33 0.14 0.14

(b) C-FS 1.00 0.96 1.00 1.00 0.96 0.91 1.00 0.99 1.00 1.00 0.99 0.98
FR 0.59 0.38 0.67 0.64 0.73 0.10 0.21 0.15 0.22 0.22 0.16 0.11
LASSO 0.59 0.55 0.57 0.53 0.48 0.31 0.18 0.18 0.18 0.17 0.12 0.11
SIRS 1.00 1.00 1.00 0.98 0.84 0.83 1.00 1.00 1.00 0.99 0.82 0.82
DC-SIS 1.00 1.00 1.00 0.97 0.95 0.97 0.44 0.43 0.42 0.39 0.34 0.32
SIS 0.96 0.95 0.93 0.88 0.95 0.76 0.31 0.31 0.29 0.29 0.23 0.19

X1 X2 X3 X4 X5 ALL X1 X2 X3 X4 X5 ALL

(c) C-FS 1.00 1.00 1.00 0.96 0.78 0.75 1.00 1.00 1.00 1.00 0.94 0.94
FR 0.99 1.00 1.00 0.93 0.56 0.51 0.45 0.59 0.64 0.45 0.28 0.24
LASSO 0.89 1.00 1.00 0.99 0.90 0.80 0.24 0.53 0.54 0.46 0.36 0.22
SIRS 0.00 0.99 0.99 0.93 0.73 0.00 0.00 1.00 1.00 0.97 0.81 0.00
DC-SIS 0.00 0.99 1.00 0.92 0.69 0.00 0.00 0.99 0.99 0.94 0.76 0.00
SIS 0.00 1.00 1.00 0.94 0.72 0.00 0.00 0.78 0.79 0.64 0.49 0.00

When x follows non-elliptical distribution

X1 X2 X3 X4 X20 ALL X1 X2 X3 X4 X20 ALL

(a) C-FS 1.00 1.00 0.98 0.89 1.00 0.86 1.00 1.00 0.98 0.93 0.99 0.89
FR 0.67 0.69 0.54 0.27 0.97 0.21 0.23 0.26 0.17 0.09 0.32 0.08
LASSO 0.52 0.54 0.46 0.28 0.48 0.26 0.16 0.17 0.14 0.08 0.14 0.08
SIRS 1.00 1.00 1.00 1.00 0.88 0.88 1.00 1.00 1.00 1.00 0.83 0.83
DC-SIS 0.99 1.00 0.99 0.98 0.93 0.91 0.52 0.53 0.51 0.46 0.40 0.37
SIS 0.84 0.87 0.81 0.71 0.98 0.67 0.32 0.32 0.31 0.25 0.35 0.22

(b) C-FS 0.99 0.96 1.00 1.00 1.00 0.95 1.00 0.99 1.00 1.00 1.00 0.99
FR 0.16 0.08 0.50 0.53 0.90 0.00 0.06 0.03 0.16 0.20 0.27 0.00
LASSO 0.09 0.08 0.21 0.18 0.22 0.04 0.03 0.02 0.06 0.05 0.06 0.01
SIRS 1.00 1.00 1.00 0.98 0.99 0.97 1.00 1.00 1.00 0.99 1.00 0.98
DC-SIS 0.96 0.96 0.98 0.88 0.99 0.86 0.39 0.39 0.42 0.36 0.41 0.33
SIS 0.37 0.37 0.63 0.66 0.96 0.24 0.15 0.14 0.24 0.26 0.32 0.08

X1 X2 X3 X4 X5 ALL X1 X2 X3 X4 X5 ALL

(c) C-FS 1.00 0.99 0.97 0.95 0.78 0.73 1.00 1.00 1.00 0.98 0.91 0.89
FR 0.96 1.00 0.98 0.92 0.55 0.49 0.48 0.62 0.63 0.45 0.29 0.24
LASSO 0.82 1.00 0.99 0.99 0.87 0.73 0.26 0.53 0.54 0.45 0.35 0.22
SIRS 0.00 0.99 0.95 0.89 0.66 0.00 0.00 1.00 0.99 0.95 0.70 0.00
DC-SIS 0.00 0.99 0.99 0.88 0.59 0.00 0.00 0.99 0.99 0.92 0.60 0.00
SIS 0.00 1.00 0.99 0.92 0.69 0.00 0.00 0.78 0.77 0.66 0.66 0.00

4.2. An Application

We further illustrate the performance of the proposed C-FS method through a rat eye expression dataset, which was previously studied by Scheetz et al. (2006) and Huang et al. (2008). This dataset consists of 31,042 probe sets of 120 twelve-week-old male rats, yet only 18,976 probes were sufficiently expressed. The response variable TRIM32 is among these 18,976 probes. This probe was found to cause Bardet-Biedl syndrome (Chiang et al., 2006). We rank the remaining 18,975 probes according to their variances and retain only 3,000 probes with the largest variances. Our analysis is based on the selected 3,000 probes, in addition to the probe TRIM32. The goal is to identify the probes that affect the expression level of TRIM32 considerably.

The sample size n = 120 is small compared with the covariate dimension p = 3, 000. We apply the aforementioned six feature selection/screening methods to this dataset and denote the retained covariates as xA^. We order the entries of A^ according to the relative importance of each retained covariate. Specifically, for SIS, DC-SIS, and SIRS, A^ is the index set of the covariates with s largest marginal effects; for C-FS, FR, and LASSO, A^ is the index set of the first s covariates that enter the active set.

We assess the performance of these methods as follows. Given a model size s, we fit an additive model

Y=j=1sfkj(Xkj)+εk, (4.1)

where k = 1, …, 6 represents C-FS, FR, LASSO, SIRS, DC-SIS, and SIS respectively. The subscript kj denotes the jth element in A^k and s is set from 1 to 10. We stop our comparison at the 10-th step because the C-FS algorithm with critical values decided by bootstrap stops at this step. In other words, all remaining null hypotheses are accepted at the significance level 0.01 and there is no need to add additional covariates.

We estimate the unknown functions fkj by the R package mgcv, where the adjusted R2 and the explained deviance are summarized in Table 5. Since the deviance explained is defined as the proportion of the null deviance explained by the fitted model, the method with larger deviance has a better performance. From Table 5, we observe that, as s increases, the proposed C-FS tends to outperform all other marginal effect based methods. This is partly because these procedures may fail to identify some truly important covariates. In comparison, the performances of C-FS, FR and LASSO are relatively satisfactory as they consider the joint effects. Among these methods, the proposed C-FS has the highest R2 and explained deviance.

Table 5:

The adjusted R2 and the explained deviance of the six methods.

model size adjusted R2
dev.explained
C-FS FR LASSO SIRS DC-SIS SIS C-FS FR LASSO SIRS DC-SIS SIS
1 0.33 0.62 0.62 0.33 0.33 0.62 0.35 0.62 0.62 0.35 0.35 0.62
2 0.63 0.68 0.68 0.68 0.68 0.68 0.66 0.69 0.69 0.71 0.69 0.69
3 0.66 0.70 0.68 0.69 0.69 0.69 0.71 0.71 0.70 0.71 0.71 0.71
4 0.72 0.72 0.70 0.70 0.70 0.70 0.74 0.73 0.72 0.73 0.71 0.72
5 0.72 0.74 0.70 0.74 0.70 0.70 0.75 0.75 0.73 0.77 0.72 0.72
6 0.79 0.75 0.73 0.76 0.71 0.73 0.83 0.77 0.77 0.80 0.73 0.77
7 0.81 0.76 0.74 0.76 0.72 0.74 0.84 0.78 0.77 0.80 0.74 0.77
8 0.81 0.77 0.77 0.75 0.72 0.73 0.84 0.79 0.80 0.79 0.74 0.77
9 0.82 0.77 0.77 0.75 0.71 0.74 0.86 0.79 0.81 0.80 0.74 0.78
10 0.84 0.78 0.77 0.74 0.73 0.74 0.88 0.80 0.81 0.78 0.76 0.78

We use the five-fold cross-validation to further compare the prediction performance. Specifically, we randomly partition the dataset into five equal sized subsamples, denoted D1,,D5. For each subsample Dk, we use the remaining four subsamples to fit model (4.1) with s = 10, then calculate the mean squared prediction error on the subsample Dk. We repeat this procedure such that the prediction is performed on each subsample exactly once. The mean squared prediction error of C-FS, FR, LASSO, SIRS, DC-SIS and SIS are 0.43, 0.74, 0.49, 0.78, 0.45 and 0.60, respectively. In this example, the C-FS gives the best prediction, followed by the DC-SIS and the LASSO.

5. CONCLUDING REMARKS

In this article, we proposed a CD-based forward screening procedure, which is model-free and robust to the presence of outliers in the response. By using a stepwise searching framework, the proposed procedure incorporates joint correlations among features in the screening process and thus provides more reliable results in applications.

In general, this forward screening procedure shares similar spirit to the iterative screening approaches (Zhu et al.,, 2011; Zhong and Zhu, 2015). However, how to decide model sizes for the iterative screening approaches and how to study their theoretical properties are rarely discussed in existing literature. Equipped with our proposed model-free forward screening procedure, a data-driven method is proposed to determine which covariates should be retained. We also show that our proposed forward screening procedure possesses the desirable sure screening property.

A model-free and robust method is often computationally intensive. Our experiences shows that the proposed C-FS procedure is more computationally demanding than other procedures. This is possibly caused by stepwise searching and adaptive thresholding, which make the proposed method more reliable and fully automatic. We conjecture that a simplified procedure may lead to less computational cost with a small sacrifice of numerical stability. In our simulation setup, each run takes no more than 4 minutes on average for n = 200 and p = 3000 on PC Intel Core2 Duo T9600 2.8GHz 4GB RAM server. This numerical cost is often acceptable in practice. It would be interesting to further develop a more efficient algorithm for the CD-based method.

Supplementary Material

Supplement

Acknowledgment

The authors thank the AE, and the reviewers for their constructive comments, which have led to a significant improvement of the earlier version of this article. Liping Zhu is the corresponding author.

Funding

This work was supported by National Natural Science Foundation of China (NNSFC) grants 11731011, 11690014, 11690015, 11731011 and 11801501, NSERC RGPIN-2016–05024, NSF grant DMS 1820702 and NIDA, NIH grant P50 DA039838, the Ministry of Education Project of Key Research Institute of Humanities and Social Sciences at Universities (16JJD910002) and National Youth Top-notch Talent Support Program, P. R. China. The content is solely the responsibility of the authors and does not necessarily represent the official views of NNSFC, MEC, NSF, NIH or NIDA.

APPENDIX: LEMMAS AND PROOFS OF THEOREMS

Appendix A: Some Lemmas

Lemma 1. (Bernstein’s Inequality, Van and Wellner (1996, Lemma 2.2.11)) Let X1, X2, …, Xn be independent random variables with mean 0 and E|Xi|mm!Mm−2υi/2 for every m ≥ 2 and i = 1, 2, …, n, where M and υi are positive constants. Then

pr{|X1+X2++Xn|>ε}2exp{ε22(υ+Mε)},forυi=1nυi.

For notational clarity, we denote s=|F| in what follows.

Lemma 2. If s = o(n1/5) and conditions (B1)–(B5) hold true, then for any kFc, and εn = Cn−κ, C and κ are positive constants,

pr{(β^k|Fβ0,k|F)T(β^k|Fβ0,k|F)>εn}<2sexp(c1ns1εn),

where c1 is a positive constant and β0,k|F is defined by argminβk|FE{Xkgk|F(xF,,βk|F)}2.

Proof of Lemma 2: The proof is given in the supplementary document. □

Appendix B: Proof of Theorems

Proof of Theorem 2: We prove the first part. Under H0, for any y, we have

E{1(Y<y)E(Xk|xF)}=E[E{1(Y<y)|xF}Xk]=E[E{1(Y<y)|xF,Xk}Xk]=E{1(Y<y)E(Xk|xF,Xk)}=E{1(Y<y)Xk}.

This completes the proof of the first part. Next we prove the second part. Define ξ(y)={ξ1(y),,ξp(y)}T=defE{F(y|x)/x}. Stein’s lemma yields that ξ(y) = Σ−1E{1(Y < y)x}. Assumption A1 ensures that, for any kA, there exists some y such that ξk(y) ≠ 0. Next, we show that for any F satisfying FcA, maxkFcAωk|F>0 holds. Define Ωk|F(y)=defE[1(Y<y){XkE(Xk|xF)}]. The normality of x indicates E(Xk|xF)=βk|FTxF, where βk|F=F1F,k. Thus Ωk|F(y)=(1,βk|FT)[E{1(Y<y)Xk}],E{1(Y<y)xFT}]T. Without loss of generality, we assume (xF,Xk) be the first |F|+1 element of x. It follows that

Ωk|F(y)=(βk|FT,1)(I|F|+1,0(|F|+1)×(p|F|1))Σξ(y)=(Σk,SΣk,FΣF1ΣF,S)ξ(y),

where ΣF1,F2=E(xF1TxF2), for F1,F2S. Under model (3.1), Ωk|F(y)=(Σk,FcAΣk,FΣF1ΣF,FcA)ξFcA(y). This yields that

kFcAΩk|F2(y)=kFcAξFcA(y)T(Σk,FcAΣk,FΣF1Σk,FcA)2ξFcA(y)=ξFcA(y)T(Σk,FcA,FcAΣFcA,FΣF1ΣF,FcA)2ξFcA(y).

Define

ΣF(FcA)=def(ΣFΣF,FcAΣFcA,FΣFcA,FcA).

Because (ΣFcA,FcAΣFcA,FΣF1ΣF,FcA)1 is sub-matrix of ΣF(FcA)1, we have

ρmin((ΣFcA,FcAΣFcA,FΣF1ΣF,FcA)1)<ρmax(ΣFc(FcA)1).

Accordingly,

ρmin(ΣFcA,FcAΣFcA,FΣF1ΣF,FcA)>ρmin(ΣFc(FcA))>ρmin(Σ),

where ρmin(M) and ρmax(M) represent the maximum and minimum eigenvalue of matrix M, respectively. This leads to that

maxkFcAΩk|F2(y)|FcA|1kFcAΩk|F2(y)|FcA|1ρmin2(ΣFcA,FcAΣFcA,FΣF1ΣF,FcA)ξFcAT(y)ξFcA(y)|FcA|1ρmin2(Σ)ξFcAT(y)ξFcA(y)>0,

for some y. This completes the proof of the second part of Theorem 2. □

For notational clarity, we denote μk|F=defE(Xk|xF) in what follows.

Proof of Theorem 3: Define

ζn,k|F(y)=defn1/2i=1n1(Yi<y){Xikgk|F(xiF,β^k|F)},fory.

By Taylor’s expansion, it follows that

ζn,k|F(y)=n1/2i=1n1(Yi<y){Xikgk|F(xiF,β^k|F)}+E{1(Y>y)gk|F(xF,β0,k|F)}T(β^k|Fβ0,k|F)+op(1).

(S.1) in the supplement gives that

ζn,k|F(y)=n1/2i=1nVk|F(Xik,xiF,Yi;y)+op(1),whereVk|F(Xik,xiF,Yi;y)=1(Yi<y){Xikgk|F(xiFβ0,k|F)}+E{1(Y<y)gk|F(xFβ0,k|F)T}Σk|F1{Xikgk|F(xiF,β0,k|F)}gk|F(xiF,β0,k|F),

and Σk|F=E[{gk|F(xF,β0,k|F)}{gk|F(xF,β0,k|F)}T].

Suppose H0 in (3.2) holds true. Theorem 2 ensures that ωk|F=0. Let ζk|F() be a mean zero Gaussian process with covariance function cov{ζk|F(y1),ζk|F(y2)}=E{Vk|F(Xk,xF,Y;y1)Vk|F(Xk,xF,Y;y2)}. It is easy to verify that E{ζn,k|F(y)}=o(1) and E{ζn,k|F2(y)}=cov{ζk|F(y),ζk|F(y)}+o(1). Thus we have ζn,k|F()dζk|F(), and consequently +ζn,k|F2(y)dFn(y)d+ζk|F2(y)dF(y). This, together with the fact than n CCov^{(Xkμk|F)|Y}d+ζn,k|F2(y)dF(y), yields than n CCov^{(Xkμk|F)|Y}d+ζk|F2(y)dF(y) (Kuo, 1975). Therefore,

n[E{+ζk|F2(y)dF(y)}]1CCov^{(Xkμk|F)|Y}dj=1λj,k|Fχj2(1).

By Slutsky’s theorem, it follows that nKk|F1ω^k|Fdj=1λj,k|Fχj2(1), where

kk|F=defE[1(Y<Y˜){Xkgk|F(xF,β0,k|F)}+E{1(Y<Y˜)gk|F(xF,β0,k|F)}TΣk|F1{Xkgk|F(xF,β0,k|F)}gk|F(xF,β0,k|F)]2/var{Xkgk|F(xiF,β0,k|F)}. (B.1)

Suppose H1 in (3.2) holds true. Recall that

CCov^{(Xkμk|F)|Y}={n1/2ζn,k|F(y)}2dFn(y)=[n1/2ζn,k|F(y)E{Vk|F(Xk,xF,Y;y)}+E{Vk|F(Xk,xF,Y;y)}]2dFn(y)=2E{Vk|F(Xk,xF,Y;y)}[n1/2ζn,k|F(y)E{Vk|F(Xk,xF,Y;y)}]2dFn(y)+E2{Vk|F(Xk,xF,Y;y)}dFn(y)+op(n1/2)=2E{Vk|F(Xk,xF,Y;y)}n1/2ζn,k|F(y)dF(y)2CCov(Xkμk|F)|Y}+E2{Vk|F(Xk,xF,Y;y)}dFn(y)+op(n1/2).

Thus

CCov^{(Xkμk|F)|Y}CCov{(Xkμk|F)|Y}=n1i=1nZi,k|F+op(n1/2),

with

Zi,k|F=def2[E{Vk|F(Xk,xF,Y;y)}Vk|F(Xik,xiF,Yi;y)dF(y)CCov{(Xkμk|F)|Y}]+E2{Vk|F(Xk,xF,Y;Yi)}CCov{(Xkμk|F)|Y}, (B.2)

where the expectation is taken with respect to (Xk,xF,Y). By the central limit theorem, n1/2[CCov^{Xkμk|F}|YCCov{Xkμk|F}|Y] converges in distribution to N(0,ςk|F2), where ςk|F2=defvar(Zi,k|F). By Slutsky’s theorem, we have n1/2(ω^k|Fωk|F)dN(0,Δk|F), where

Δk|F=defςk|F2/{var(Xkμk|F)}2. (B.3)

This completes the proof of Theorem 3.

The uniform consistency of ω^k|F paves the road for proving Theorem 4.

Proposition 1. Under conditions (B1)-(B5), for any εn > 0, there exists positive constants c1, c2, c3, c4 and sufficiently small sεn(0,2/εn) such that

pr{maxkFc|ω^k|Fωk|F|>εn}O[pexp{nlog(1εnsεn/2)/3}+pexp(c1nεn2)+pnexp(c2nεn2)+psexp(c3ns2εn2)+psexp(c4ns2εn)].

Set εn = Cn−κ, for some constants C > 0 and κ > 0. If there exists ϑ > 0 such that p = o{exp(anϑ)} for any a > 0 and 3/5 − 2κϑ > 0, then

maxF:|F|=o(n1/5)pr{maxkFc|ω^k|Fωk|F|>Cnk}0asn.

Proof of Proposition 1: The proof is given in the supplementary document. □

Proof of Theorem 4: For notational clarity, we define the random event E1 = {There exists an index set F, such that FcA and maxkFcAω^k|Fν}. For such an F, we have, by Assumption A3, maxkFcAωk|FmaxkFcAω^k|F>Cnϖν. Consequently,

maxkFcA(ωk|Fω^k|F)maxkFcAωk|FmaxkFcAωk|FCnϖ/2.

Define the random event E2={maxkFcA(ωk|Fω^k|F)>Cnϖ/2,forFinE1}. The above discussions imply that E1E2. It follows that

pr(minF:FcAmaxkFcAω^k|F>ν)=1pr(E1)1pr(E2).

Proposition 1 implies that pr(E2) → 0, which completes the proof of Theorem 4. □

REFERENCES

  1. Barut E, Fan J and Verhasselt A (2016). “Conditional sure independence screening” Journal of the American Statistical Association 111, 1266–1277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bickel P and Levina E (2008). “Covariance regularization by thresholding” The Annals of Statistics 36, 2577–2604. [Google Scholar]
  3. Buldygin VV and Kozachenko YV (1980). Sub-Gaussian random variables. Ukrainian Mathematical Journal, 32, 483–489. [Google Scholar]
  4. Candes E and Tao T (2007). “The Dantzig selector: statistical estimation when p is much larger than n (with discussion).” The Annals of Statistics 35, 2313–2404. [Google Scholar]
  5. Chang J, Tang CY, and Wu Y (2013). “Marginal empirical likelihood and sure independence feature screening.” The Annals of Statistics, 41, 2123–2148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, and Shefield VC (2006). “Homozygosity mapping with snp arrays identifies a novel gene for bardet-biedl syndrome.” Proceeding of the National Academy of Sciences, 103, 6287–6292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fan J, Feng Y and Song R (2011). “Nonparametric independence screening in sparse ultra-high dimensional additive models.” Journal of the American Statistical Association, 106, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fan J, and Li R (2001). “Variable selection via nonconcave penalized likelihood and it oracle properties.” Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
  9. Fan J and Lv J (2008). “Sure independence screening for ultrahigh dimensional feature space (with discussion).” Journal of the Royal Statistical Society, Series B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J and Peng H (2004). “Nonconcave penalized likelihood with a diverging number of parameters.” Annals of Statistics, 32, 928–961. [Google Scholar]
  11. Fan J, Samworth R and Wu Y (2009). “Ultrahigh dimensional feature selection: beyond the linear model.” Journal of Machine Learning Research, 10, 1829–1853. [PMC free article] [PubMed] [Google Scholar]
  12. Fan J and Song R (2010). “Sure independence screening in generalized linear models with NP-dimensionality.” The Annals of Statistics, 38, 3567–3604. [Google Scholar]
  13. He X, Wang L, and Hong H (2013). “Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data.” The Annals of Statistics 41, 342–369. [Google Scholar]
  14. Huang J, Ma SG, and Zhang CH (2008). “Adaptive lasso for sparse high-dimensional regression models.” Statistica Sinica, 18, 1603–1618. [Google Scholar]
  15. Huber PJ (1973). “Robust regression: asymptotics, conjectures and Monte Carlo”. The Annals of Statistics, 1: 799–821. [Google Scholar]
  16. Huber PJ and Ronchetti EM (2009). Robust Statistics The Second Edition. Wiley: New York. [Google Scholar]
  17. Jennrich RI (1969). “Asymptotic properties of non-linear least squares estimators.” The Annals of Mathematical Statistics, 40: 633–643. [Google Scholar]
  18. Kuo HH (1975) Gaussian Measures in Banach Spaces. Lecture Notes in Mathematics Springer: Berlin. [Google Scholar]
  19. Li G, Peng H, Zhang J and Zhu L (2012). “Robust rank correlation based screening.” The Annals of Statistics, 40, 1846–1877. [Google Scholar]
  20. Li R, Zhong W and Zhu L (2012). “Feature screening via distance correlation learning.” Journal of the American Statistical Association, 107, 1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ma S, Li R and Tsai C-L (2017). “Variable screening via quantile partial correlation”. Journal of American Statistical Association, 112, 650–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mai Q and Zou H (2013). “The Kolmogorov filter for variable screening in high-dimensional binary classification.” Biometrika, 100, 229–234. [Google Scholar]
  23. Mai Q, Zou H, and Yuan M (2012). “A direct approach to sparse discriminant analysis in ultra-high dimensions.” Biometrika, 99, 29–42. [Google Scholar]
  24. Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Shefield VC, and Stone EM (2006). “Regulation of gene expression in the mammalian eye and its relevance to eye disease.” Proceeding of the National Academy of Sciences, 103, 14429–14434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Shao XF and Zhang JS (2014) “Martingale difference correlation and its use in high dimensional variable screening.” Journal of the American Statistical Association. 109, 1302–1318. [Google Scholar]
  26. Song R, Yi F, and Zou H (2014). “On varying-coefficient independence screening for high-dimensional varying-coefficient models.” Statistica Sinica, 24, 1735– 1752. [PMC free article] [PubMed] [Google Scholar]
  27. Székely GJ, Rizzo ML and Bakirov NK (2007) “Measuring and testing dependence by correlation of distances.” The Annals of Statistics, 35, 2769–2794. [Google Scholar]
  28. Székely GJ and Rizzo ML (2009) “Brownian distance covariances.” The Annals of Statistics, 3, 1236–1265. [Google Scholar]
  29. Tan FL and Zhu LX (2018). “Adaptive-to-model checking for regressions with diverging number of predictors.” The Annals of Statistics, to appear. [Google Scholar]
  30. Tibshirani R (1996), “Regression shrinkage and selection via LASSO,” Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
  31. Van Der Vaart AW and Wellner JA (1996). Weak convergence and empirical processes, New York: Springer. [Google Scholar]
  32. Wang H (2009). “Forward regression for ultra-high dimensional variable screening.” Journal of the American Statistical Association, 104, 1512–1524. [Google Scholar]
  33. White H (1981).”Consequences and detection of misspecified nonlinear regression models. “ Journal of the American Statistical Association, 76, 419–433. [Google Scholar]
  34. Xu C and Chen J (2014). “The Sparse MLE for ultra-high-dimensional feature screening.” Journal of the American Statistical Association, 109, 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhong W and Zhu LP (2015). “An iterative approach to distance correlation-based sure independence screening.” Journal of Statistical Computation and Simulation, 85, 1–15. [Google Scholar]
  36. Zhu LP, Li L, Li R, and Zhu LX (2011). “Model-free feature screening for ultrahigh dimensional data.” Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zou H (2006). “The adaptive lasso and its oracle properties.” Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES