Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2021 Feb 10;117(539):1516–1529. doi: 10.1080/01621459.2020.1864380

Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems

Debmalya Nandy 1,*, Francesca Chiaromonte 2,3, Runze Li 2
PMCID: PMC9512254  NIHMSID: NIHMS1670945  PMID: 36172297

Abstract

Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p >> n), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Sure Independence Screening (Fan and Lv, 2008) and other model- and correlation-based feature screening methods, we propose a model-free procedure called Covariate Information Number - Sure Independence Screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher Information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods.

Keywords: Ultrahigh dimension, Supervised problems, Sure independence screening, Model-free, Fisher information, Affymetrix GeneChip Rat Genome 230 2.0 Array

1. Introduction

Contemporary high-throughput experimental and surveying techniques employed in many scientific fields often generate data on an enormous number of variables. This is the case, for example, in the “Omics” sciences (Genomics, Transcriptomics, Proteomics, Metabolomics, etc.) and in biomedical applications involving imaging and/or the analysis of extensive electronic medical records. Extracting meaningful and interpretable information from these data often requires studying the association of a response with thousands to millions of predictors (p) – here and in the following, we use “feature”, “predictor”, and “covariate” interchangeably, all indicating a potential explanatory variable for the response. Even when these supervised problems have what appears to be a large sample size (n), this can in fact be orders of magnitude smaller than p. For instance, in transcriptomic studies n may be in the hundreds, but p may be in the thousands or tens of thousands. This is commonly referred to as an ultrahigh-dimensional setting, and often linked to the fact that p increases as a function of n (collecting more observations produces more fetatures). Roughly speaking, one can have pexp{O(nξ)},ξ>0 (Fan and Lv, 2008). Importantly, in ultrahigh-dimensional settings, association signals are often sparse, that is, only a handful of predictors contribute to explaining variation in the response.

When p exceeds n, computing or inverting sample covariance matrices (Σ^) to estimate dependencies among predictors becomes very inaccurate, numerically unstable, or downright unfeasible (Ledoit and Wolf, 2004; Schäfer and Strimmer, 2005; Bickel and Levina, 2008; Chen et al., 2015). This, in turn, negatively affects regression model fitting, classification methods, and also techniques for supervised dimension reduction – in their standard versions, most of these tools employ some versions of Σ^ or Σ^1.

Popular methods such as LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), Elastic Net (Zou and Hastie, 2005), and MCP (Zhang, 2010) use penalties to regularize supervised problems, performing feature selection and model fitting simultaneously (see Fan and Lv (2010) for a comprehensive overview). In practice, many of these methods may successfully handle p > n scenarios, but they also deteriorate and might not scale up in realistic time when pn (see for instance Fan and Lv (2008), Table 1, p. 862). Fan et al. (2011) further discuss the challenges of high and ultrahigh dimension in classification. “Curse of dimensionality” issues including computational burden, statistical inaccuracy, and algorithmic instability call for alternative approaches in tackling ultrahigh-dimensional supervised problems (Fan et al., 2009). One such approach is feature screening, pioneered by the development of Sure Independence Screening (SIS; Fan and Lv (2008)) and extensively studied ever since, it led to model-based, model-free, correlation-based and distance-based procedures for a variety of supervised problems including regression, classification, discriminant analysis, survival analysis, etc. Liu et al. (2015) provide a comprehensive overview of work prior to 2015 (we also provide a selected list of references in Section S5 of the Supplement).

Table 1.

Total computation time (seconds; median over 100 runs) required for computing p = 2000 marginal utilities (Model(3)(a) with ΣX=ΣX(I), σ = 0.5, and n = 200) for different screening procedures. Number of slices (L) for CIS varies as indicated in the table. HOLP was run in R (R Core Team (2020); version 4.0.2) and the rest in MATLAB (MATLAB (2020); version 9.9.0.1467703 (R2020b)).

SIS HOLP SIRS DC-SIS MDC-SIS CIS[3] CIS[5] CIS[8] CIS[10] CIS[12]
0.082 0.074 0.414 1.222 0.741 3.338 3.054 2.966 2.980 3.010

In this article, we propose Covariate Information Number - Sure Independence Screening (CIS), a model-free feature screening procedure based on a novel marginal utility called Covariate Information Number (CIN). For each covariate Xj, the CIN captures marginal association with the response Y without assuming any specific underlying model, and can be interpreted in terms of the traditional Fisher information in Statistics. CIN is computed from the joint density of (Y, Xj) employing kernels to estimate marginal and inverse conditional densities within response sub-populations that are naturally defined when Y is categorical or discrete, and artificially generated by slicing the range if Y is continuous. Thus, our approach can be employed irrespective of the nature of the response and, in fact for continuous Y, it is robust to outliers (it uses only the ranks of the Y values for slicing). Moreover, switching the roles of Y and Xj in computing the CIN, our approach can be utilized also to screen discrete or categorical covariates – as long as the response is continuous. Because of the way information on marginal and conditional densities is used in the CIN calculation, in addition to being model-free, our approach does not require strong assumptions on the predictors (the rationale is similar to that discussed in Yao et al. (2019)). Under mild regularity conditions, we show that the CIS procedure built upon the CIN marginal utility possesses the sure screening property (Fan and Lv, 2008). Overall, we find that CIS is competitive with, and sometimes better than, other popular feature screening procedures.

We compare CIS to five other procedures: (a) Sure Independence Screening (SIS; Fan and Lv (2008)), (b) High dimensional Ordinary Least squares Projection (HOLP; Wang and Leng (2016)), (c) Sure Independent Ranking and Screening (SIRS; Zhu et al. (2011)), (d) Distance Correlation - Sure Independence Screening (DC-SIS; Li et al. (2012)), and (e) Martingale Difference Correlation - Sure Independence Screening (MDC-SIS; Shao and Zhang (2014)). SIS postulates a naive linear regression model, and screens the predictors based on the magnitudes of their Pearson Correlations with the response. It is intuitive, computationally straightforward, and possesses the sure screening property. SIRS, DC-SIS, and MDC-SIS are among the most popular model-free feature screening procedures in the literature. Specifically, SIRS does not postulate a specific underlying model but relies instead on a general framework which includes many common parametric and semiparametric models. It possesses rank consistency, a stronger property than sure screening, and allows both univariate and multivariate responses. Moreover, similar to our CIS, it only considers response ranks and is therefore robust to outliers. DC-SIS screens the predictors based on their Distance Correlations (Székely et al., 2007) with the response – a measure of departure from independence for two random vectors, built through characteristic functions. It possesses the sure screening property, allows both univariate and multivariate responses, and can also handle grouped predictors. MDC-SIS uses Martingale Difference Correlations, which are a natural extension of Distance Correlations. It screens predictors that contribute to the conditional mean of Y | X, and has the sure screening property. Notably, to tackle regressions with heteroscedastic errors, the authors of MDC-SIS proposed an extension that screens based on contributions to conditional quantiles. While the feature screening procedures discussed above involve computing the marginal utilities independently, HOLP involves a joint estimation of these measures. HOLP is motivated by the ordinary least squares estimator and ridge regression, straightforward and efficient to compute, and relaxes the often violated assumption of strong marginal correlations between each truly associated covariate and the response. However, similar to SIS, HOLP also postulates a naive linear regression model. All procedures in (a)–(e), as well as CIS, allow p to grow exponentially with n.

The rest of the article is organized as follows. Section 2 contains the details of our proposal – formulation, properties, and implementation of the CIN; the CIS algorithm; and the theoretical properties of CIS under appropriate assumptions. Section 3 presents an extensive simulation study to compare the performance of CIS to those of the five popular screening procedures mentioned above. Section 4 presents an application to transcriptomic data on Norway rats (GeneChip Rat Genome 230 2.0 Array Data; (Scheetz et al., 2006)). Concluding remarks are provided in Section 5, whereas proofs of theoretical results, full simulation results, details on the transcriptomic data application, and some relevant additional information are provided in an online Supplement (an “S” in the numbered references below indicates Sections, Tables, and Figures in the Supplement).

2. CIN – Sure Independence Screening (CIS)

In this section we describe our proposal. We introduce the general setup, provide details on the formulation, properties, and implementation of our Covariate Information Number (CIN) marginal utility, describe our CIN-based screening algorithm (CIS) and discuss its theoretical properties under appropriate assumptions. Notation is similar to that in Zhu et al. (2011).

Consider a univariate response Y with support ΘY and a p-dimensional covariate vector X = (X1, …, Xp)T with pn. Let F (y | x) = P (Yy | x) denote the conditional distribution of Y given X = x in the following definitions of the two index sets:

A={j:F(yx)isfunctionallydependentonXjforsomeyΘY}I={j:F(yx)isnotfunctionallydependentonXjforanyyΘY}={1,2,,p}\A.

A indexes predictors that are truly associated with the response; it is called the active set. I indexes the remaining, inactive predictors. Note that, in this definition, F (y | x) is completely generic – no model form is specified. Let s=|A| denote the cardinality of A; s out of p covariates are active, and therefore, measures the sparsity level of the association between Y and X. In any feature screening procedure, including our CIS, the objective is to estimate A conservatively; that is, with a minimal prevalence of false negatives.

2.1. Covariate Information Number (CIN)

Next, we expand upon the CIN, the marginal utility for our proposed CIS. Some of the developments follow directly as special cases (p = 1) of those for the Covariate Information Matrix (Yao et al., 2019).

Let f(y | xj) and fj (xj) respectively denote the conditional density of Y given Xj = xj and the marginal density of Xj, and assume that the standard regularity conditions for likelihood analysis hold (see Section 2.6). Treating the observed xj as a “parameter”, we can use the traditional Fisher information formulation to create the quantity

Fxj=[xjlog(f(yxj)]2f(yxj)dy.

Capturing the information that f(y | xj) would carry about xj, if it were in fact unknown, this provides a local measure of association. Based on the bivariate joint distribution of (Y, Xj), the Covariate Information Number (CIN) for covariate Xj is then defined as the expected value of this quantity; that is

ωj=Fxjfj(xj)dxj. (1)

Estimates of the scalars ωj,j=1,2,p, which are theoretically non-negative by definition, are key components of the final form of the marginal utilities (see below) we use for ranking the covariates in our CIS screening procedure.

We next introduce two other quantities which are relevant to our proposal. Here we need to also consider fj(xj | y) and f(y), the inverse conditional density of Xj given Y = y and the marginal density of Y, again with standard regularity conditions (see Section 2.6). The density information (Hui and Lindsay, 2010; Lindsay and Yao, 2012; Yao et al., 2019) in the marginal density of Xj is defined as

JXj=[xjlog(fj(xj))]2fj(xj)dxj. (2)

Similarly, the density information in the conditional density of Xj |Y = y is defined as

JXjY=y=[xjlog(fj(xjy))]2fj(xjy)dxj,

and can be averaged to produce

JXY=JXY=yf(y)dy. (3)

The marginal utility which CIS uses for each covariate Xj, j = 1,2, …, p, is the CIN (ωj) normalized by its density information, i.e. ωj*=ωj/JXj. The next theorem summarizes some properties of ωj* (proofs are in Section S1).

Theorem 2.1

(Properties of the normalized CIN).

  1. The normalized CIN ωj*=0 if and only if Y and Xj are statistically independent.

  2. If (Y, Xj) follows a bivariate normal distribution with correlation coefficient ρj, then the normalized CIN ωj* is a monotonically increasing function of |ρj,|.

  3. ωj can be expressed as ωj=JXjYJXj. Hence, the normalized CIN ωj* is
    ωj*=JXjYJXjJXj=JXjYJXj1. (4)
  4. If a0 and b are two constants and X˜j=aXj+b, then the normalized CIN of X˜j is ω˜j*=ωj*, the normalized CIN for Xj.

Henceforth, we will ignore the subtraction of 1 in (4) (which has no effect in the ranking of the Xj’s). We will consider JXjYJXj and simply refer to it as the CIN. Properties (i) and (ii) motivate its use as a marginal utility in feature screening: positive values of ωj* correspond to statistical dependence between Y and Xj and, in a bivariate Gaussian scenario where the association is linear, ωj* increases with the absolute value of the correlation coefficient. Notably, this fact implies that ωj* is a more general measure of association than the marginal utility employed by SIS (Fan and Lv, 2008) – capturing also potential non-linear dependencies between Y and Xj. (iii) reformulates the CIN as the ratio of the average density information in fj (xj | y) (inverse regression) to the density information in fj (xj), the marginal density of the predictor Xj. Following the argument in Yao et al. (2019), the ratio in Equation (4) “cleanses” the association signal from potential distributional peculiarities of Xj, and renders ωj* effective also for “not well-behaved” covariates. Finally, (iv) describes the effects of affine transformations on ωj*.

Note that (3) can be easily adapted for a discrete or categorical response by simply replacing the integral with an appropriate sum. If Y{y(1),,y(L)} with Pr(Y=y(l))=πl,l=1,,L, one has

JXjY=l=1LπlJXjY=y(l). (5)

The ratio of (5) to (2) provides a straightforward definition of the CIN in discrete regressions or classification problems.

2.2. Estimation of the CIN

Up to this point, we have defined and characterized the CIN theoretically, at the population level. Next, we describe its estimation for the practical implementation of our CIS procedure on sample data.

Three facts are key: First, we write the CIN through (4), which comprises two quantities, JXjY and JXj. Second, regardless of the nature of the response, we write JXjY through its formulation in (5); if the response is continuous, we create an approximate version with “sub-populations” defined by slicing the range of Y. Notably, this slicing strategy is used by most sufficient dimension reduction methods based on inverse regression, for example, SIR (Li, 1991), SAVE (Cook and Weisberg, 1991), SR (Wang and Xia, 2008), and CIM (Yao et al., 2019). Third, we estimate JXj and the components JXjY=y(l),l=1,2L of JXjY through kernels.

Let us drop the predictor index j to simplify notation (X now stands for a generic predictor) and start with the estimation of JX. (2) expresses JX as an expectation: JX=EX[Xlogf(X)]2=EX[g(X)]. We therefore need to estimate the density f(·) and the expectation of the function g(x)=[xlogf(x)]2. We use a kernel density estimator

fn(x;h)=1ni=1nkh(xxi)withkh(t)=1hk(th), (6)

and replace the theoretical expectation by the sample average, which gives JX=E^X[g^(X)]=1ni=1ng^(xi). Next, using observations within slices (if Y is continuous) or natural sub-populations (if Y is discrete or categorical), we produce each JXY=y(l),l=1,2,,L in exactly the same way, and set π^l=nln,l=1,2,,L (where n is the number of observations with Y = y(ℓ)). Thus, we estimate

JXY=l=1Lπ^lJXY=y(l). (7)

Finally, we compute a ratio to produce: ω^j*=JXY/JX.

An important remark is in order for the case of a continuous response: slices are customarily produced as to contain (approximately) the same number of observations. Thus, the partitioning does not use the observed values yi, i = 1,2…, n, but rather their ranks. Consequently, similar to SIRS (Zhu et al., 2011), our CIS screening is robust to outliers in the response (see Model(4) in Section 3.2).

2.3. Tuning parameters

Following the description regarding (5) and in Section 2.2, when the response is discrete or categorical, the number of “slices” L is given. However, when the response is continuous, the choice of L is critical. This is a well recognized challenge in inverse regression-based Sufficient Dimension Reduction methods (see Wang and Xia (2008); Yao et al. (2019)). There is a trade off between using more slices to achieve a more accurate approximation of the overall object of interest (in our case, JXY) and using fewer slices to have a sufficiently large number of observations for in-slice calculations (in our case, the estimation of each JXY=y(l)). For simulations in Section 3, we used total sample sizes n = 200 and 600 and investigated the CIS screening performance for L = 2,3,5,8,10, and 12. CIS performance does vary with L, because we need a large enough sample size within each slice for kernel density estimation to be reliable. We find that moderate values (e.g., L= 3 – 8) work well in most cases, and that the effect of L becomes negligible as the total n becomes larger and/or the signal to noise ratio in the data becomes stronger (see Sections 3 and S3).

Another critical choice in our estimation is that of kernel and bandwidth. In our implementation, we use a simple Gaussian kernel for kh(·) (kh(l)() for sub-population with Y = y(ℓ)) and Silverman’s rule of thumb for the bandwidth, which sets h=1.06×σ^xn1/5(h(l)=1.06×σ^x(l)nl1/5; see (Silverman, 2018)), where σ^x(σ^x(l)) is the sample standard deviation of X (X(ℓ) : X-observations with Y = y(ℓ)).

2.4. Computational burden

Computational burden is an important consideration for any feature screening procedure. In addition to the number of covariates to be screened (p), it depends on the time needed to calculate each marginal utility. Our CIN marginal utility (ω^j*) has a reasonable computational cost, making CIS viable also in applications with large number of covariates (see Section 4), and comparable to other model-free screens. For instance, we performed a comparative study on the elapsed computation times for CIS and five additional screening procedures (see Sections 1, 3.2, and 4) for a simulation scenario with p = 2000 and n = 200 (this used Model(3)(a) with σ = 0.5 and Σx=ΣX(I); see Section 3.2). In this comparison, all methods except HOLP were implemented using MATLAB (MATLAB, 2020), version 9.9.0.1467703 (R2020b). HOLP was implemented in R (R Core Team, 2020), version 4.0.2, using the GitHub R package screening available at https://github.com/wwrechard/screening (see Section S6 for more details). All codes were run on a MacBook Pro 2019 laptop with macOS Mojave Version 10.14.6, 2.3GHz Intel Core i9 processor, and 16GB 2400MHz DDR4 RAM.

Taking the medians over 100 simulation runs, CIS with L = 5 slices took ≈ 3.054 seconds to compute all p = 2000 marginal utilities (see also Table 1). This was higher but comparable to the widely used model-free DC-SIS, that took ≈1.222 seconds. As to be expected given the much simpler nature of their marginal utilities, SIS and HOLP were much faster – taking respectively ≈ 0.082 and ≈ 0.074 seconds. SIRS and MDC-SIS, which are also model-free, took respectively ≈ 0.414 and ≈ 0.741 seconds (see Table S25 for run times with n = 600). Note that, except for HOLP, since the utility of each covariate is computed marginally, total computation time scales linearly with p. Extrapolating from the calculations above, in an application with n = 200 and as many as p = 2, 000, 000 covariates, CIS would compute all marginal utilities in ≈ 50 minutes. Of course, computation time could be vastly reduced implementing screens in more efficient computer programming languages, such as C (Ritchie, Kernighan and Lesk, 1988).

2.5. CIS algorithm

Let (yi, xi),i = 1,2 …,n be a random sample from the distribution of (Y, X). To implement CIS in practice, we proceed as follows:

  1. For j = 1,2…, p, compute JXjY, JXj, and ω^j*=JXjYJXj (Section 2.2).

  2. Order ω^(1)*ω^(2)*ω^(p)* and estimate Ad as the top d ranking covariates.

The normalization in (1) ensures that all marginal utilities are on the same density information scale. Moreover, by Theorem 2.1(iv), the ratio expressing ω^j* is invariant to affine transformations of Xj, e.g. marginal centering and/or scaling to unit variance.

The calculations in (1) involve the tuning parameters L (number of slices, if the response is continuous) and h (bandwidth used in kernel density estimation) discussed in Section 2.3. Notably, (2) involves another crucial quantity that plays a role possibly much more vital than those of tuning parameters in the marginal utility calculation: the number d of covariates retained in the screen. Following the literature (for instance, Fan and Lv (2008); Li et al. (2012); Shao and Zhang (2014)), in our simulations we employ a hard threshold defined as d=k×nlog(n) with constant multiplier k =1,2, or 3 (see Sections 3 and S3). However, this is an intriguing open question for screening algorithms. For instance, Zhu et al. (2011) show that one can use a hard threshold, a soft threshold, or a combination of both. How to select d in an effective and data-driven fashion is beyond the scope of this article, but we hint at the development of a potential diagnostics in Section 5.

2.6. Sure screening property of CIS

In this section, we establish the sure screening property (Fan and Lv, 2008) for our CIS, built upon the CIN marginal utility. At the outset, we adjust the definition of the estimated active set as follows:

A={j:ω^j*c0nκ,1jp}, (8)

where c0 > 0 and 0<κ<γ<13 are given constants (see below). In proving the sure screening property, we considered a discrete or categorical Y with L distinct values or labels. Recall that if Y is continuous, we operate with its “discretized” version obtained through slicing (see Sections 2.2 and 2.3).

Assumptions and regularity conditions

In proving sure screening, we assume that all active covariates Xj, jA satisfy a minimum signal strength condition, and specifically that:

minjAωj*2c0nκ, (9)

where c0 > 0 and 0<κ<γ<13 are the same constants that appear in (8). Note that, here the minimum value for the signals (the true marginal utilities) is twice c0n−κ. As pointed out in Liu et al. (2014), this assumption bounds the marginal utilities of active covariates away from 0 for any finite n. However, as n increases, this minimum signal strength can decrease, converging to 0 asymptotically. This fact indicates that, when n is very large, our procedure with the sure screening property can retain covariates whose marginal association with Y is negligible, but are jointly associated with the response. This assumption corresponds to Condition 3 in Fan and Lv (2008) and is commonly used in the screening literature (see for instance, Li et al. (2012); Shao and Zhang (2014)). Also note that, while (9) states the assumption on the minimal signal strength of the active covariates, we do not assume any condition on the order of the maximum signal strength to establish the sure screening property of CIS.

We also assume that the number of Y sub-populations (or slices) L is finite, and impose some regularity conditions on (i) kernel densities and associated bandwidths: k(),h,kh(l)(),h(l),l=1,,L, used for estimating each CIN (see (6) and (7)); (ii) marginal and inverse conditional covariate densities: fj (·) and fj(Y=y(l)),l=1,2L; and (iii) marginal and inverse density information: JX and JXY=y(l),l=1,2,,L. These regularity conditions are described below, neglecting again the covariate subscript j for notational simplicity.

Kernel densities:

C1. kh(·) and kh(l)(),l=1,2,,L, have bounded variance.

C2. h=O(nγ), 0<κ<γ<13; and for each l=1,2,L,h(l)=O(nlγ), where n is the number of observations with Y = y(ℓ).

C3. kh(·) and kh(l)(),l=1,2,,L, are order-1 kernels with non-vanishing first derivatives (see Section S1.5).

C4. supxχkh(x) and supxχkh(l)(x),l=1,2,,L are bounded above.

C5. the β-th moments (1 ≤ β < 2 ) of absolute values for the kernel densities kh(·) and kh(l)(),l=1,2,,L, are finite.

Covariate densities:

C6. f2 (·) and f2(Y=y(l)),l=1,2,L, are uniformly bounded away from zero.

C7. F′(·) and f(Y=y(l)),l=1,2,L, belong to the Hölder class Σ(β,Λ) where 1 ≤ β < 2 and Λ > 0 are constants (see Section S1.5).

C8. supxχf(x) and supxχf(xY=y(l)),l=1,2,L, are bounded above.

C9. supxχf(x) and supxχf(xY=y(l)),l=1,2,L are bounded above.

C10. supxχ(f(x))2f2(x) and supxχ(f(xY=y(l)))2f2(xY=y(l)),l=1,2,L are bounded above.

Density informations:

C11. min1lLJXY=y(l), and hence JXY, are uniformly bounded away from zero.

C12. JX,min1lLJXY=y(l), and hence JXY, are uniformly bounded above (from C10).

C13. JX and JX are bounded away from zero.

In the context defined by the assumptions and conditions above, we have the following theorem (additional details and proofs are provided in Sections S1.6 - S1.8).

Theorem 2.2

(Sure screening property for CIS). Let ξ0* and c0 be positive constants, κ and γ be constants such that 0<κ<γ<13, and n(1)=min1lLnl be the size of the least numerous class (or slice). For j = 1,2,…, p, we have

P(max1jp|ω^j*ωj*|>c0nκ)O(npexp(nκn(1)γξ0*)),

where ωj* and ω^j* are the true and the estimated CIN marginal utilities, respectively. Moreover, using the definition of A in (8) and assuming the minimum signal strength condition in (9), we have

P(AA)1O(nsnexp(nκn(1)γ)ξ0*)), (10)

where sn is the cardinality of the active set A.

Note that the cardinality of A in (10), sn, is indexed as to indicate dependence on n: CIS guarantees sure screening when the number of covariates (p) as well as the number of active covariates grow as we gather more observations (see also Li et al. (2012); Shao and Zhang (2014); Liu et al. (2014)). Concerning the way p grows with the sample size, the exponent in (10) shows that CIS guarantees sure screening also with a non-polynomial log(p)=o(nκn(1)γ)=o(π^(1)xn(1)γκ), where π^(1)=n(1)n is the smallest class proportion. Recall that, when Y is continuous, π^(1)1L since we create slices containing approximately equal numbers of observations (see Section 2.2). Finally, we note that, similar to other model-free approaches such as DC-SIS (Li et al., 2012), CIS guarantees sure screening under much more generic conditions compared to SIS (Fan and Lv, 2008) – in particular, CIS does not require a linear regression function for Y onto X.

3. Simulation study

In this section, we present simulation results on the performance of CIS in comparison to those of Sure Independence Screening (SIS; Fan and Lv (2008)), High dimensional Ordinary Least squares Projection (HOLP; Wang and Leng (2016)), Sure Independent Ranking and Screening (SIRS; Zhu et al. (2011)), Distance Correlation - Sure Independence Screening (DC-SIS; Li et al. (2012)), and Martingale Difference Correlation - Sure Independence Screening (MDC-SIS; Shao and Zhang (2014)). Some of the simulation scenarios are adapted from Zhu et al. (2011), Cui et al. (2015), and Chen et al. (2018).

3.1. Summary statistics to assess screening performance

Because screening procedures are used as a preliminary step, followed by modeling and fitting efforts in which predictors are further assessed, their main priority is sensitivity; as we separate active from inactive predictors, we want to minimize false negatives; that is, cases in which jA but jA. Thus, to measure performance, we consider two summary statistics commonly used in the literature:

  1. For each simulated data set, we compute the maximum rank achieved by true predictors Xj,jA, or equivalently, the minimum rank in A required to ensure AA. Following Zhu et al. (2011), we denote this by R and call it the ranking measure. R close to s=|A| is evidence of ranking consistency, another important property for feature screening procedures (see Zhu et al. (2011)). In the result tables below, we present the median (median absolute deviation in parentheses) of R over N = 1000 simulated datasets corresponding to each simulation scenario.

  2. For each simulation scenario, fixing d=|Ad|=k(n/log(n)) (k = 1, 2 or 3; see Section 2.5), we compute the proportion of simulated data sets (out of N) in which AAd. Following Shao and Zhang (2014), we denote this by Pa. Pa close to 1 is evidence of sure screening for a procedure. To further investigate which among the active predictors are easier/harder to retain for a procedure, we also consider predictor-specific inclusion proportions denoted by Pj, jA. We call Pa and the Pj‘s as inclusion measures.

3.2. Simulation scenarios

We create simulation scenarios based on the elements described below.

  1. Sample size, number of predictors, and number of active predictors (n, p, s). We use p = 2000, n = 200 and 600, and s=|A| varying between 3 and 40 (this controls the sparsity level; the smaller the s, the sparser the problem). As the only exception, for Model(3)(d) we use n = 117 (see below).

  2. Nature of the predictors. We simulate the p entries in the covariate vector X with different schemes. We start by drawing from a p-variate Gaussian X~Np(0,ΣX) (see below for covariance specifications) and: (i) we keep the vector as drawn, to have p continuous predictors (Models(1), (2), (3)(a), (4)(5)); (ii) we replace 50% of the entries in X with independently drawn binary predictors (Model(3)(b)); (iii) we replace 50% of the entries in X with independently drawn “perturbed” continuous predictors obtained from a Gaussian Mixture spiked with a very high variance component (Model(3)(c)). In addition, we consider (iv) predictors randomly selected from the ones in our real data application in Section 4 (Model(3)(d)).

  3. Structure of the predictor covariance matrix. For (i)-(iii) above, the multivariate Gaussian X~Np(0,Σx) has four different covariance specifications: (i) ΣX(I)=Ip, the identity matrix (Independent); (ii) Σx(A)={σij},σij=0.8|ij|,i,j=1,2,p (Auto-regressive); (iii) ΣX(B)={σij},σii=1;σij=0.4 for ij, i, j both A or both I; and σij = 0.1 for iA, jI or iI, jA,i,j=1,2,p (Block-structure); and (iv) ΣX(C)={σij},σii=1;σij=0.2 for ij, i, j = 1,2,…p (Compound-symmetric).

  4. Response generating process. We generate a continuous Y using single- or multi-index models. These comprise m = 1,2, or 3 indexes (i.e. linear combinations of the predictors Xj, jA) acting linearly or non-linearly on the mean or the variance of Y | X (Models (1)(5)).

  5. Nature of the error. We always use additive errors and consider: (i) two homoscedastic cases, namely a Gaussian error ϵ1~N1(0,1) and a mixture of Gaussian errors ϵ2~0.5×N1(0,1)+0.5×N1(0,102); and (ii) a heteroscedastic case, namely a Gaussian error ϵ3~N1(0,g2(σ,βTX)).

  6. Signal-to-Noise Ratio (SNR). We define SNR=Var(E(YX))E(Var(YX)). When we use ϵ1 or ϵ2, all signal is contained in the mean E(Y | X); Var (Y | X) does not depend on X. In these cases, SNR has the standard definition. When we use ϵ3, Var(Y | X) = g2(σ, βTX) itself contains “signal”; SNR benchmarks the signal in the mean to that in the variance. For the homoscedastic cases, we vary a scalar multiplier σ in the mean functions E(Y | X), and for the heteroscedastic case, the σ in g2 (σ, βTX), as to obtain SNRs ranging between 0.05 and 20 (see below).

In more detail, for the response generating process, we consider five models:

Model(1) Variation of Example 1 in Zhu et al. (2011).

Linear, single-index with homoscedastic additive error: Y=σ(β1TX)+ϵ1 with β1 = (1,0.8,0.6,0.4,0.2,0,…,0)T. Here m = 1 and A={1,2,3,4,5} with s = 5. We use n = 200; X~Np(0,ΣX) with both ΣX(A) and ΣX(B); and σ ranging between 0.34–2.02 to give rise to SNRs in the range ≈ 0.8 (“low”) to ≈ 20 (“high”). Given its underlying assumptions, we included HOLP (Wang and Leng, 2016) in our comparisons only for this model, where its performance should be among the best.

Model(2) Variation of Example 3.b in Zhu et al. (2011).

Multi-index with homoscedastic additive error: Y=β1TX+exp(β2TX)+ϵ1 with β1=(2U1,,2Us/2,0,,0)T, β2=(0,,0,2+Us/2+1,,2+Us,0,,0)T and Uk’s independently drawn from a uniform distribution on [0,1]. Here m = 2 and A={1,,s}. We use s = 4,8,16,24,32, and 40; n = 200; and X~Np(0,ΣX) with both ΣX(A) and ΣX(B).

Model(3) Variation of Example 3.1 in Chen et al. (2018).

Multi-index with homoscedastic additive error: Y=σ(X1+0.75X22+2.25cos(X5))+ϵ1. Here m = 3 and A={1,2,5} with s = 3. For this model, we implement different specifications, namely: (a) Continuous predictors. We use n = 200 and 600; X~Np(0,ΣX) with both Σx(I) and Σx(C) (the compound-symmetric covariance should hinder screening, since the covariates possess sizable and equal correlations within and between A and I); and σ = 0.50,1.25,2.50, respectively giving rise to SNR ≈ 0.8 (“low”), ≈ 5 (“moderate”), and ≈ 20 (“high”). (b) Mix of continuous and binary predictors. We use n = 200; X ~ Np (0, ΣX) with both Σx(I) and Σx(C), in which 50% of the entries (X1 and a random selection from X6X2000 ) is replaced with 0/1 entries drawn independently with success probabilities equal to the sample qth quantiles of the Xj being replaced, using q from 0.30 to 0.70; and σ = 1.5 and 2.1, respectively giving rise to SNR ≈ 5 (“moderate”) and ≈10 (“high”). (c) Mix of continuous and “perturbed” predictors. We use n = 200 and 600; X ~ Np (0, ΣX) with both Σx(I) and Σx(C), in which 50% of the entries (X1 and a random selection from X6X2000) is replaced with entries independently drawn from the univariate Gaussian mixture 0.95 × N1(0,1) + 0.05 × N1(0,10)2; and σ = 0.805 and 1.14, respectively giving rise to SNR ≈ 5 (“moderate”) and ≈10 (“high”). (d) “Realistic” predictors. For each simulation repetitions, we randomly sub-sample p = 2000 predictors from the set of 18941 gene/probe ID expressions in our transcriptomic application (see Section 4). We use the sample size n = 117 of that application, and σ = 1.25 and 1.77, respectively giving rise to SNR ≈ 5 (“moderate”) and ≈10 (“high”). Before generating the response Y, we standardize all predictors marginally to have mean 0 and variance 1. Since the sample size is smaller compared to the other simulation scenarios, we only use CIS with a small number of slices (L = 2,3, and 5). Finally, to account for the complexity of this “realistic” predictor data, here we use the hard thresholds corresponding to n = 600, i.e. d = 93 (n/log(n)), 187 (2n/log(n)), and 281 (3n/log(n)).

Model(4)

Same as Model(3)(a), but with a mixture of Gaussian errors ϵ2:Y=σ(X1+0.75X22+2.25cos(X5))+ϵ2 The mixture induces heavier tails, and thus increased variance, for error (and response) compared to previous models.

Model(5).

Multi-index with heteroscedastic error: Y=X1+X22+ϵ3 with g(σ,βTX)=exp{σ|X22|}. Here m = 3 and A={1,2,22} with s = 3. We use n = 200 and 600; X ~ Np(0, ΣX) with ΣX=ΣX(A); and σ in the range 0.23 – 1.37, giving rise to SNRs in the range 2–0.05. Recall that, instead of the standard definition, SNR here benchmarks the strength of the mean signal to that of the variance signal.

In addition to the above scenarios, all with a continuous response, we also investigate scenarios with a categorical response (see Model(6) in Section S3). For each of the scenarios described above, we simulate N = 1000 datasets to compute the performance summary statistics, and we assess the effect of the number of slices on CIS screening for varying L between 2 – 12.

3.3. Simulation results

Due to space constraints, here we present results only for selected scenarios of Models(2) and (3)(a), and selected number of slices (L) used in CIS. Full results for all models with all scenarios and all choices of L are reported in Section S3.

Table 2 contains ranking measures (R) for Model(2), summarizing performance under different predictor covariance structures and sparsity levels. Under the block covariance structure ΣX(B), CIS and SIRS outperform all other procedures for all values of s considered (s = 4 to 40 active predictors out of p = 2000). Under the auto-regressive covariance structure ΣX(A), CIS and SIRS again outperform other procedures. However, as sparsity decreases (s > 16), CIS deteriorates faster than SIRS. A potential explanation is that under ΣX(A), as the number of active predictors s increases, more inactive predictors highly correlated with their adjacent active ones confound the CIS ranking. On the contrary, under ΣX(B), the level of correlation between active and inactive predictors is fixed at a relatively low 0.1. Notably, SIS, DC-SIS and MDC-SIS perform very poorly – except under ΣX(A) and very marked sparsity (s = 4).

Table 2.

Median (median absolute deviation) of the ranking measure R over N = 1000 simulated datasets. Model(2) with n = 200 and p = 2000. Sparsity (s) and the predictor covariance structure (ΣX) vary as indicated in the table. CIS results shown for L = 5 slices. Performance is better for values closer to s.

SIS SIRS DC-SIS MDC-SIS CIS[5] SIS SIRS DC-SIS MDC-SIS CIS[5]
s = 4 s = 24
ΣX(A) 30 (25) 4 (0) 9 (5) 24 (20) 4 (0) 1852 (107) 29 (4) 1846 (111) 1860 (101) 84 (56)
ΣX(B) 248 (231) 4 (0) 59 (55) 254 (242) 4 (0) 1427 (359) 24 (0) 1351 (366) 1513 (355) 24 (0)
s = 8 s = 32
ΣX(A) 679 (502) 9 (1) 610 (454) 688 (519) 8 (0) 1918 (58) 52 (17) 1915 (62) 1917 (60) 382 (248)
ΣX(B) 692 (462) 8 (0) 537 (403) 767 (502) 8 (0) 1600 (286) 32 (0) 1554 (287) 1644 (268) 33 (1)
s =16 s = 40
ΣX(A) 1694 (244) 18 (1) 1644 (261) 1692 (241) 19 (3) 1935 (45) 93 (43) 1931 (50) 1937 (46) 807 (403)
ΣX(B) 1204 (456) 16 (0) 1130 (434) 1270 (450) 16 (0) 1695 (223) 40 (0) 1659 (241) 1723 (215) 41 (1)

ΣX(A) = Auto-regressive; ΣX(B) = Block structure.

Table 3 contains inclusion measures (Pa and Pj,jA={1,2,3,4}) for Model(2), again under both ΣX(A) and ΣX(B) – but focusing on the s = 4 case. The excellent and comparable performance of CIS and SIRS is evident from these inclusion measures. Interestingly, the poorer performance of SIS, DC-SIS, and MDC-SIS is driven by their inability to capture the active covariates X1 and X2 involved in the first index, that is, the linear component in E(Y | X) (see Model(2)). While predictors acting linearly ought to be easy to identify, also with the Pearson correlation-based SIS, the exponential scale possibly renders the signals associated with X3 and X4 much stronger.

Table 3.

Inclusion measures Pa and Pj, j =1,2,3,4, over N = 1000 simulated datasets, provided at thresholds d = n / log(n), 2n / log(n),3n / log(n) (triplets in parentheses). Model(2) with n = 200, p = 2000, and s = 4. The predictor covariance structure (ΣX) varies as indicated in the table. CIS results shown for L = 5 slices. Values are multiplied by 103; closer to 1K = 1000 indicate better performance.

s = 4 SIS SIRS DC-SIS MDC-SIS CIS[5]
ΣX(A) ΣX(B) ΣX(A) ΣX(B) ΣX(A) ΣX(B) ΣX(A) ΣX(B) ΣX(A) ΣX(B)
P1 (572, 697, 730) (411, 494, 538) (1K, 1K, 1K) (1K, 1K, 1K) (750, 818, 865) (593, 679, 733) (594, 690, 734) (426, 498, 538) (1K, 1K, 1K) (999, 999, 1K)
P2 (813, 888, 919) (406, 488, 548) (1K, 1K, 1K) (1K, 1K, 1K) (933, 964, 974) (603, 678, 721) (830, 895, 925) (415, 498, 546) (1K, 1K, 1K) (997, 998, 998)
P3 (997, 1K, 1K) (939, 967, 978) (1K, 1K, 1K) (1K, 1K, 1K) (999, 1K, 1K) (993, 995, 996) (998, 1K, 1K) (951, 972, 983) (1K, 1K, 1K) (1K, 1K, 1K)
P4 (996, 1K, 1K) (927, 958, 969) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (984, 990, 993) (997, 1K, 1K) (943, 965, 977) (1K, 1K, 1K) (1K, 1K, 1K)
Pa (547, 658, 715) (238, 315, 367) (1K, 1K, 1K) (1K, 1K, 1K) (743, 812, 858) (434, 538, 588) (569, 669, 720) (242, 315, 363) (1K, 1K, 1K) (996, 997, 998)

ΣX(A) = Auto-regressive; ΣX(B) = Block structure.

Table 4 contains ranking measures for Model(3)(a), summarizing performance under different predictor covariance structures, SNR levels, and sample sizes. Under both ΣX(I) and ΣX(C), the ranking performances of CIS, DC-SIS, and MDC-SIS beat those of SIS and SIRS for all SNRs and sample sizes. In particular, at moderate and high SNRs (5 and 20 respectively), CIS, DC-SIS, and MDC-SIS successfully rank the three active predictors as the top three. For higher sample size (n = 600), this ranking performance improves even at low SNR (0.8). Notably, although SIRS is a model-free procedure, it fails for Model(3)(a) in all scenarios – most likely due to the presence of m = 3 active indexes violating condition (C1) in Zhu et al. (2011).

Table 4.

Median (median absolute deviation) of the ranking measure R over N = 1000 simulated datasets. Model(3)(a) with p = 2000 and s = 3. Sample size (n), SNR, and predictor covariance structure (ΣX) vary as indicated in the table. CIS results shown for L = 5 slices. Performance is better for values closer to s = 3.

SIS SIRS DC-SIS MDC-SIS CIS[5] SIS SIRS DC-SIS MDC-SIS CIS[5]
n = 200, SNR = 0.8 n = 600, SNR = 0.8
ΣX(I) 1279 (406) 1425 (365) 28 (19) 22 (14) 148 (127) 1333 (411) 1470 (358) 3 (0) 3 (0) 3 (0)
ΣX(C) 1398 (379) 1451 (350) 80 (61) 65 (51) 190 (158) 1477 (352) 1474 (338) 4 (1) 4 (1) 3 (0)
n = 200, SNR = 5 n = 600, SNR = 5
ΣX(I) 1230 (460) 1423 (375) 3 (0) 3 (0) 4 (1) 1183 (473) 1513 (318) 3 (0) 3 (0) 3 (0)
ΣX(C) 1394 (382) 1485 (337) 10 (7) 11 (7) 5 (2) 1493 (372) 1504 (321) 3 (0) 3 (0) 3 (0)
n = 200, SNR = 20 n = 600, SNR = 20
ΣX(I) 1231 (465) 1414 (398) 3 (0) 3 (0) 3 (0) 1246 (439) 1504 (335) 3 (0) 3 (0) 3 (0)
ΣX(C) 1449 (379) 1562 (294) 6 (3) 7 (4) 3 (0) 1561 (346) 1522 (321) 3 (0) 3 (0) 3 (0)

ΣX(I) = Independent, ΣX(C) = Compound-symmetric.

Tables 5 (n = 200) and 6 (n = 600) containing the inclusion measures for Model(3)(a) support the sure screening property of CIS, under both ΣX(I) and ΣX(C), and low, moderate, and high SNRs. The excellent and comparable performance of CIS (except for low SNR and sample size), DC-SIS, and MDC-SIS is once again evident. The Pearson correlation-based SIS and, interestingly, also the model-free SIRS fail to capture X2 and X5 – which are non-linearly associated with Y. Notably, when n = 200, CIS performs better with L = 3 (CIS[3]) than with L = 5 (CIS[5]), likely due to more abundant observations available for in-slice calculations. In fact, for n = 200, SNR = 5 and 20, and ΣX(C) (the compound-symmetric structure that ought to hinder screening), the Pa‘s for CIS[3] are the highest.

Table 5.

Inclusion measures Pa and Pj, j = 1, 2, 5 over N = 1000 simulated datasets, provided at thresholds d = n / log(n),2n / log(n),3n / log(n) (triplets in parentheses). Model(3)(a) with p = 2000, s = 3, and n = 200. SNR and predictor covariance structure (ΣX) vary as indicated in the table. CIS results shown for L = 3 and 5 slices. Values are multiplied by 103; closer to 1K = 1000 indicate better performance.

n=200 SNR ≈ 0.8 SNR ≈ 5 SNR ≈ 20
ΣX(I) ΣX(C) ΣX(I) ΣX(C) ΣX(I) ΣX(C)
SIS P1 (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (60, 93, 118) (49, 76, 104) (101, 142, 185) (62, 97, 132) (111, 160, 207) (73, 108, 144)
P5 (46, 76, 109) (22, 46, 72) (60, 100, 132) (44, 66, 83) (72, 106, 138) (49, 84, 107)
Pa (3, 9, 11) (0, 1, 3) (6, 18, 29) (2, 5, 12) (3, 13, 25) (2, 6, 10)
SIRS P1 (999, 1K, 1K) (999, 999, 999) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (27, 52, 89) (23, 44, 64) (36, 58, 86) (18, 33, 50) (52, 91, 121) (27, 45, 63)
P5 (36, 71, 94) (19, 39, 62) (39, 62, 89) (25, 46, 57) (41, 75, 93) (18, 39, 56)
Pa (0, 1, 7) (1, 2, 3) (2, 4, 7) (0, 2, 3) (0, 6, 11) (0, 2, 2)
DC-SIS P1 (999, 1K, 1K) (999, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (748, 875, 915) (489, 642, 734) (997, 1K, 1K) (840, 911, 948) (999, 999, 1K) (902, 949, 966)
P5 (818, 923, 954) (582, 735, 812) (998, 1K, 1K) (932, 969, 980) (1K, 1K, 1K) (974, 992, 998)
Pa (601, 804, 871) (314, 484, 610) (995, 1K, 1K) (788, 886, 929) (999, 999, 1K) (879, 941, 964)
MDC-SIS P1 (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (810, 907, 946) (530, 688, 766) (998, 1K, 1K) (862, 933, 955) (999, 1K, 1K) (914, 960, 977)
P5 (841, 940, 967) (582, 750, 816) (1K, 1K, 1K) (901, 952, 973) (1K, 1K, 1K) (949, 980, 986)
Pa (671, 849, 914) (331, 536, 639) (998, 1K, 1K) (780, 887, 929) (999, 1K, 1K) (865, 940, 963)
CIS[3] P1 (639, 731, 779) (591, 683, 740) (996, 998, 998) (990, 993, 994) (999, 999, 999) (1K, 1K, 1K)
P2 (680, 749, 788) (610, 701, 745) (942, 964, 973) (929, 950, 961) (960, 980, 985) (944, 962, 973)
P5 (787, 843, 872) (735, 799, 845) (984, 989, 993) (969, 979, 984) (997, 999, 1K) (993, 998, 999)
Pa (335, 455, 532) (245, 369, 454) (923, 952, 965) (891, 923, 939) (956, 978, 984) (937, 960, 972)
CIS[5] P1 (572, 674, 736) (538, 635, 701) (989, 994, 997) (984, 992, 996) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (628, 726, 766) (551, 639, 697) (910, 943, 956) (899, 941, 957) (947, 965, 973) (927, 948, 956)
P5 (723, 792, 825) (673, 761, 802) (967, 977, 980) (958, 972, 980) (993, 995, 998) (987, 991, 996)
Pa (235, 372, 443) (184, 298, 380) (868, 915, 933) (843, 906, 933) (940, 960, 971) (914, 939, 952)

ΣX(I)= Independent, ΣX(C)= Compound-symmetric.

The results described above represent only a small portion of the extensive simulation experiments we conducted (see Section 3.2). Below, we describe salient trends and observations based also on the additional results presented in Section S3. CIS exhibits promising performance in terms of sure screening as well as rank consistency under a wide range of scenarios. Overall, as expected intuitively, CIS performs better at larger SNRs and sample sizes, where it is less sensitive to the number of slices (L) used for a continuous Y. In Model(1) scenarios, CIS shows competitive sure screening and rank consistency performance compared to that of other screening procedures (Tables S1 and S2), especially for larger SNRs (5 and 20) and smaller L (3 and 5). In Model(2) scenarios, CIS does better than SIS, DC-SIS, and MDC-SIS; performs very good when sparsity is high and, while it tends to deteriorate at lower sparsity under one predictor covariance structure, it remains stable across sparsity levels for the other (Tables 2, 3, S3 and S4). In all the Model(3) scenarios ((a)-(d)) CIS does better than SIS and SIRS – which fail for reasons similar to those articulated above for Model(3)(a). In general, CIS has good performance at “moderate” and “high” SNRs, and its deterioration at lower SNR can be counteracted increasing the sample size and/or using a smaller number of slices to guarantee a sufficient number of observations per slice (Tables 4 - 6 and S5 - S15). Moreover, the measures for inclusion (Pa) and ranking (R) provide empirical evidence for sure screening and ranking consistency of CIS, respectively. This is true when all predictors are continuous (Model(3)(a); Tables 4 - 6, S5S7), but also in cases where continuous predictors are mixed with categorical predictors (Model(3)(b); Tables S8S9), or with“perturbed” continuous predictors (Model(3)(c); Tables S10S13), and when predictors are sub-sampled from real data (Model(3)(d); Tables S14S15). Notably, in the latter “realistic” scenario, which carries substantial collinearities (see Section 4 and Figure S1), CIS with L = 3 slices (CIS[3]) performs the best, beating also the otherwise strongest competitor DC-SIS. CIS performs very well also with the heavier-tailed error of Model(4) (Tables S16 - S18; results are similar to those for Model(3)(a)). When n = 200, almost all Pa‘s for CIS[3] beat those for DC-SIS and MDC-SIS under ΣX(C) and, with a larger sample size (n = 600), CIS performs competitively with these methods in all respects. SIS and SIRS fail in all Model(4) scenarios for reasons similar to those discussed for Model(3)(a). In Model(5) scenarios, where the error is heteroscedastic, CIS and DC-SIS are the best overall performers (Tables S19S21). When n = 200, ranking performance of CIS with L = 5 slices (CIS[5]) is better than with smaller or higher L, likely due to the added complexity of capturing signals in the variance (as opposed to the mean). When n = 600, once again CIS performs well, along with DC-SIS, across all L = 3 – 12 and SNR levels less than 1. Notably, MDC-SIS always fails to capture the active predictor X22 present in the variance component of the model (Table S19). This is because its marginal utility is designed to detect predictors contributing to the conditional mean of the response (Shao and Zhang, 2014). Finally, results for Model(6) scenarios demonstrate that CIS performs quite well also in problems with categorical responses (Tables S22S24).

Table 6.

Same as Table 5 for Model(3)(a), with sample size n = 600.

n=600 SNR ≈ 0.8 SNR ≈ 5 SNR ≈ 20
ΣX(I) ΣX(C) ΣX(I) ΣX(C) ΣX(I) ΣX(C)
SIS P1 (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (103, 165, 231) (86, 142, 193) (168, 249, 307) (107, 170, 213) (170, 234, 290) (125, 187, 238)
P5 (66, 121, 184) (72, 132, 187) (122, 187, 241) (89, 142, 188) (115, 174, 237) (78, 132, 182)
Pa (2, 11, 41) (7, 16, 30) (22, 53, 81) (7, 19, 36) (18, 44, 68) (10, 22, 40)
SIRS P1 (885, 948, 975) (875, 942, 967) (997, 999, 1K) (998, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (22, 52, 86) (22, 58, 95) (15, 44, 74) (21, 45, 83) (6, 28, 57) (16, 48, 78)
P5 (77, 138, 195) (82, 130, 185) (123, 181, 230) (90, 157, 209) (106, 177, 226) (85, 136, 185)
Pa (0, 4, 14) (2, 5, 15) (1, 7, 19) (2, 4, 13) (2, 7, 14) (0, 3, 9)
DC-SIS P1 (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (1K, 1K, 1K) (990, 997, 999) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P5 (1K, 1K, 1K) (995, 999, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
Pa (1K, 1K, 1K) (985, 996, 999) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
MDC-SIS P1 (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (1K, 1K, 1K) (991, 998, 998) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P5 (1K, 1K, 1K) (994, 999, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
Pa (1K, 1K, 1K) (985, 997, 998) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
CIS[3] P1 (994, 999, 999) (991, 996, 996) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (987, 992, 995) (979, 993, 997) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P5 (1K, 1K, 1K) (998, 999, 999) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
Pa (981, 991, 994) (968, 988, 992) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
CIS[5] P1 (989, 992, 993) (986, 993, 996) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P2 (984, 992, 995) (976, 982, 988) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
P5 (998, 998, 999) (997, 998, 999) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)
Pa (971, 982, 987) (959, 973, 983) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K) (1K, 1K, 1K)

ΣX(I)= Independent, ΣX(C) = Compound-symmetric.

4. Application to transcriptomic data

In this section, we analyze a transcriptomic dataset (Affymetrix GeneChip Rat Genome 230 2.0 Array Data; (Scheetz et al., 2006)) already used in feature screening and variable selection literature (Fan et al. (2011b); Wang et al. (2012); Huang et al. (2010); Shao and Zhang (2014); Wang and Leng (2016)). The expression Quantitative Trait Loci (eQTL) experiment in Scheetz et al. (2006) used gene transcription measurements from 120 12-week old male F2 Norway rats (Rattus norvegicus) to better understand gene regulation in the mammalian eye, with potential relevance to the study of human eye disease. The dataset is publicly available at the NCBI Gene Expression Omnibus with accession number GSE5680.

Following Shao and Zhang (2014), we pre-process the data taking log transformations and eliminating all genes (more accurately, probe IDs) that do not show sufficient variation across rats – which leaves us with 18976 genes. As in prior screening exercises conducted on this dataset, we consider as response the transcription of TRIM32 (probe ID: 1389163_at), a gene with a causal association with Bardet-Biedl syndrome which affects multiple human systems (Chiang et al., 2006). As a further preprocessing step, we identify outliers detected by both the built-in R statistical software function boxplot() and the “thrice median absolute deviation rule” (Barghash et al., 2016) (see for instance Figure S2). We eliminate 34 genes that contain more than 12 outliers (10% of the total number of rats). Also, we omit three outliers detected for the response. Thus, we eventually work with a dataset comprising transcription levels for p = 18941 genes (the predictors) and transcription levels for TRIM32 (the response) measured on n = 117 rats (the observations). On this dataset, we apply our CIS with L = 3 slices (CIS[3]), as well as SIS, HOLP, SIRS, DC-SIS, and MDC-SIS screening procedures, and GAMSEL (Chouldechova and Hastie, 2015) – a generalized additive model selection procedure.

First, for each gene, we exclude outliers (12 or fewer values) and compute marginal utilities for all screening procedures considered. Next, we focus on the top ranked d = 10 genes, which differ substantially across procedures (Table S26) – likely due to linear associations among predictors (the absolute values of pair-wise Pearson correlations range between 0 and 0.9812; first quartile 0.0847, median 0.1796, third quartile 0.3053) and/or other complexities of the problem (e.g., level of sparsity, strength and nature of the signals). Notably though, two of the top 10 CIS[3] genes (ranks 1 and 6) are also within the top genes reported in Fan et al. (2011b). Also, the genes ranked 6 and 9 by CIS[3] are placed in the top 10 by all other procedures (except HOLP). DC-SIS and SIRS also include the gene ranked 5 by CIS[3] in their top 10. Interestingly, none of the top 10 genes for HOLP overlap with the top 10 of any other procedure considered.

Figure 1 illustrates the marginal associations between the transcription of TRIM32 (response) and those of the top 10 CIS[3] genes (predictors). The panels for the genes ranked 5, 8, and 10 clearly show non-linear relationships, supporting the notion that our CIN, unlike the marginal utility used by SIS, is a general measure of association. Preliminary queries indicate that the top 10 CIS[3] genes do indeed have biological significance. Most of them are conserved in other mammalian and vertebrate species including human, chimpanzee, Rhesus monkey, dog, cow, mouse, chicken, zebrafish and frog – suggesting that they fulfill critical functions in the genome. Klhl7 (the gene ranked 7), is involved in the eye disease Retinitis pigmentosa (RP) in Norway rats; see the Rat Genome Database (RGD), ID 1305564. According to the Online Mendelian Inheritance in Man (OMIM) database, its human ortholog KLHL7 also plays a role in human RP (OMIM ID 611119; Friedman et al. (2009); Wen et al. (2011)). In addition, according to the Mouse Genome Informatics (MGI) database, during wild-type mice development Klhl7 is expressed in the retina (MGI ID 1196453), Rbfox1 (the gene ranked 8) in the retina ganglion cell layer (MGI ID 1926224), and Dr1 (the gene ranked 5) in the retinal inner and outer layers (MGI ID 1100515). Mutations in Fam49b (the gene ranked 2) are involved in abnormal retinal morphology in mice (MGI ID 1923520). Finally, not directly related to the eye, Gorab (the gene ranked 4) plays a role in gerodermia osteodysplastica, osteoporosis and skin abnormalities in Norway rats (RGD ID: 1564990).

Fig. 1.

Fig. 1

Scatterplots of transcription levels of TRIM32 against each of the top d = 10 genes identified by CIS[3]. Solid lines are LOESS smooths; dashed lines are 2-SD prediction bands. Genes marked as ** are also among the top d = 10 of other screening procedures (Rank 5, DC-SIS and SIRS; Ranks 6 and 9, all but HOLP).

When we increase d from 10 to 5000, the overlaps across the top genes identified by various screening procedures increase (e.g., the top 5000 CIS[3] and DC-SIS genes have ≈ 72% overlap; see Table S26). Following Wang and Leng (2016), we consider the top 5000 genes produced by each screening procedure for subsequent modeling. We marginally standardize each set of 5000 top-ranked predictors, as well as the response, to zero mean and unit variance and employ GAMSEL (Chouldechova and Hastie, 2015), a penalized likelihood approach for fitting sparse generalized additive models in high dimension, using the CRAN package gamsel (Chouldechova et al., 2018). We also apply GAMSEL directly on all (standardized) p = 18941 predictors without any screening. We tune the overall penalty parameter (λ ≥ 0) by 10-fold cross-validation, fixing the folds across runs for reproducibility and selecting the largest λ with cross-validation error within 1 standard error of the minimum. We set the penalty mixing parameter (0 ≤ γ ≤ 1; values < 0.5 penalize the linear fit less than the non-linear fit) to γ = 0.6 and the degrees (the maximum number of spline basis functions to use) to 5 for each predictor. All other parameters are left at their default values.

Table 7 shows that CIS[3] leads to the highest deviance explained (81.64%), followed by SIS (81.11%). Notably, GAMSEL applied to all p = 18941 predictors leads to the lowest deviance explained. To provide a benchmark, we create a “null” distribution as follows: we select d = 5000 genes at random 1000 times, each time fitting GAMSEL and producing the corresponding deviance explained. The density plot in Figure 2 shows that the deviance explained with 5000 CIS[3]-screened genes is significantly larger than those expected when randomly selecting 5000 genes. In contrast, the deviance explained with 5000 HOLP-screened genes is not. Nor is the deviance explained with GAMSEL applied to all genes.

Table 7.

Deviance explained (%) and number of non-zero coefficient estimates selected by 10-fold cross-validated λ (“lambda.1se”) from GAMSEL fits on the top d = 5000 genes/probe IDs ranked by different screens or directly applied to all p = 18941 (“GAMSEL”).

SIS HOLP SIRS DC-SIS MDC-SIS CIS[3] GAMSEL
Deviance explained (%) 81.11 56.04 70.88 67.83 64.32 81.64 47.76
# of non-zero coefficients 48 18 32 30 28 45 14

Fig. 2.

Fig. 2

Density plot of deviance explained (%) from 1000 GAMSEL fits, each using d = 5000 randomly selected genes/probe IDs. Symbols mark performance achieved with those identified by CIS[3], SIS, SIRS, DC-SIS, MDC-SIS, and HOLP screens or directly applied to all p = 18941 (“GAMSEL”); empirical p-values in parentheses in the legends.

Finally, we evaluate the out-of-sample performance of GAMSEL fits on the d = 5000 top ranked genes produced by each screening procedure, as well as on all p = 18941 genes, and on a random selection of 5000 genes for benchmarking. We produce 200 90% – 10% training-validation random splits of the n = 117 observations. We run GAMSEL fits on the training sets, and compute prediction errors on the corresponding validation sets. Figures S3 (a)(c) display box-plots of, respectively, the training-set deviance explained (in %), the number of non-zero training-set coefficient estimates obtained with 10-fold cross-validated λ’s, and the validation-set root mean squared prediction error (RMSPE). On the training sets, HOLP and GAMSEL applied to all genes have similar median deviance explained (≈ 50%) and number of non-zero coefficients (≈15), which are comparable to those achieved with a random selection of 5000 genes. All other screening procedures lead to better fits (median deviance explained in the range ≈ 65% – 70%) and larger gene sets (median number of non-zero coefficient estimates ≈ 25). Perhaps not surprisingly, given the small sizes of both training and validation sets (106 and 11, respectively) compared to dimensionality (d = 5000 for GAMSEL following screens, and p = 18941 for GAMSEL directly applied to all genes), for all procedures the RMSPE’s computed on the validation sets are on par with those obtained with a random selection of 5000 genes.

5. Concluding remarks

In this article, we proposed Covariate Information Number - Sure Independence Screening (CIS) – a model-free feature screening procedure to reduce the predictor dimension in ultrahigh-dimensional supervised problems prior to the use of other statistical techniques for feature selection, dimension reduction, and regression or classification modeling.

CIS is built upon the Covariate Information Number (CIN), a novel marginal utility which is essentially the univariate version of the Covariate Information Matrix (Yao et al., 2019) and has an appealing interpretation in terms of the traditional Fisher Information in Statistics. It is applicable to any type of response (features) – continuous, discrete, or categorical – with continuous features (response), has a reasonable computational burden, and possesses the important sure screening property.

Our simulation results demonstrate that CIS is competitive with, and in some cases superior to, popular feature screening procedures such as SIS, HOLP, SIRS, DC-SIS, and MDC-SIS. CIS successfully identifies active covariates at all levels of sparsity, with both continuous and categorical responses as well as categorical, “perturbed”, and “realistic” predictors. As to be expected, it outperforms SIS in the presence of non-linear signals but, notably, it also outperforms DC-SIS and MDC-SIS in less sparse settings (higher number of active covariates). Moreover, it outperforms SIRS when active covariates affect the response through more than two linear combinations (i.e. indexes in multi-index models). Importantly, in addition to sure screening, our simulation results provide empirical evidence that CIS possesses the ranking consistency property.

Like most procedures, the performance of CIS improves with higher sample sizes and signal-to-noise ratios. The former is particularly relevant because CIS calculations require a reasonable number of observations per slice. Our general suggestion is to employ a relatively small number of slices (say, L = 3–8). Notably though, L ceases to affect CIS performance when the sample size is sufficiently large. Switching to the case of discrete or categorical responses, where L represents the number of distinct Y values, we note that this (similar to the number of active predictors and that of predictors overall) can increase with n. We considered a finite L to theoretically establish the sure screening property for CIS, but the proof could potentially be generalized to a diverging L.

While the sure screening property addresses false negatives concerns, screens can retain false positives in cases where correlations between inactive and active covariates produce spurious association with the response (Fan and Lv (2008)). One way to mitigate this issue is to use iteration. For example, an iterative model-based screening procedure can be found in Fan et al. (2009). Iterative model-free screening procedures also exist, and are often based on the notion of predictor residual matrix. This was first introduced for iterative SIRS in Zhu et al. (2011) and latter used for iterative DC-SIS (Zhong and Zhu, 2015). An iterative CIS could be developed as well. Of course, iteration increases computational burden. In practice, an evaluation of the strength and structure of the associations among covariates can help gauge whether such burden is justified as a way to reduce potential false positives. Further discussion on iteration can be found in Fan et al. (2009). The interested reader can also refer to Univariate Penalization Screening (UPS; Ji and Jin (2012)), Covariance Assisted Screening and Estimation (CASE; Ke et al. (2014)), and graphlet screening (Jin et al. (2014)), among others, for ideas on two-stage “screen and clean” procedures to tackle potential false positives.

We foresee several additional avenues for future work. One is combining different screening approaches. For instance, consider a composite marginal utility of the form ω(τ)=τωS(1)+(1τ)ωS(2), where S(1) and S(2) indicate two different screens and τ ϵ[0,1] is a weighing parameter. ω(τ), especially with an appropriate data-driven tuning of T, could combine the strengths of different approaches. As another instance, consider the selection of the threshold d used for separating active and inactive covariates (both soft and hard thresholding rules are discussed in the literature; see Fan and Fan (2008); Zhu et al. (2011); Li et al. (2012); Shao and Zhang (2014)). Let c(d)=AdS(1)AdS(2)| be the cardinality of the intersection of the active sets estimated by two screens using d. By construction; a plot of c(d) vs d, d = 1,2…, can be used as a visual diagnostics to identify d* where c(d*) comes very close to d*, that is, a threshold that guarantees high congruence between screens. Both the composite marginal utility and the threshold diagnostic plot, of course, could potentially combine more than two screens.

Another interesting future avenue is investigating the performance of CIS in the rare and weak signal regimes often encountered in Genome Wide Association Studies (GWAS), and in cases where the assumption that zero low-order marginal correlations imply zero higher-order partial correlations (also known as the “faithfulness” condition; Genovese et al. (2012)) is violated due to factors such as signal cancellation (Wasserman and Roeder, 2009). Procedures such as the Covariance Assisted Screening and Estimation (Ke et al., 2014) and the graphlet screening (Jin et al., 2014) address these issues.

We mentioned (Section 1) and demonstrated via numerical examples (Sections 3.2 and S3) that CIS can also be used to screen discrete or categorical predictors – as long as the response is continuous. Another important avenue for future work is the extension of CIS to cases where both the response and the covariates are discrete or categorical, as well as to cases where the response is multivariate. Developments in the former (e.g., Huang et al. (2014); Cui et al. (2015)) and the latter (e.g., Zhu et al. (2011); Li et al. (2012); Shao and Zhang (2014)) directions already exist in the feature screening literature.

5. Supplementary material and codes

Proofs of theoretical results, full simulation results, details on the transcriptomic data application, and some relevant additional information are provided in an online Supplement. MATLAB (MATLAB, 2020) and R (R Core Team, 2020) source functions for the implementation of CIS and other feature screening procedures, codes for the numerical examples in the simulation study, and the analyses of the transcriptomic data are publicly available at the following link: bit.ly/CIS-Codes.

Supplementary Material

Supp 1

Acknowledgments

F. Chiaromonte and D. Nandy were supported by NSF grant DMS-1407639. R. Li was supported by NSF grants DMS-1820702, DMS-1953196 and DMS-2015539, and NIH grants R01CA229542, R01ES019672 and R21CA226300.

We thank Drs. Bharath Sriperumbudur, Amal Agarwal, and Mauricio Nascimento for helping with theoretical derivations; Dr. Weixin Yao for MATLAB codes to compute Covariate Information Matrices; Dr. Paolo Inglese for MATLAB code to compute distance correlations; and Drs. Xiaofeng Shao and Jingsi Zhang for R code to compute martingale difference correlations, the transcriptomic data, and R codes for its preprocessing. We also thank members of the Makova Lab at Penn State and Binglan (Victoria) Li for helping with the transcriptomic data application. Finally, we are grateful to the anonymous reviewers and the Associate Editor for crucial feedback that helped us greatly improve our work.

References

  1. Barghash A, Arslan T, and Helms V (2016). Robust detection of outlier samples and genes in expression datasets. Journal of Proteomics and Bioinformatics, 9:38–48. [Google Scholar]
  2. Bickel PJ and Levina E (2008). Regularized estimation of large covariance matrices. Annals of Statistics, 36(1):199–227. [Google Scholar]
  3. Chen Y, Chi Y, and Goldsmith AJ (2015). Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61(7):4034–4059. [Google Scholar]
  4. Chen Z, Fan J, and Li R (2018). Error variance estimation in ultrahigh-dimensional additive models. Journal of the American Statistical Association, 113(521):315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski RE, Nishimura DY, Braun TA, Kim K-YA, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, and Sheffield VC (2006). Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet–biedl syndrome gene (bbs11). Proceedings of the National Academy of Sciences, 103(16):6287–6292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chouldechova A and Hastie T (2015). Generalized additive model selection. arXiv preprint arXiv:1506.03850. [Google Scholar]
  7. Chouldechova A, Hastie T, and Spinu V (2018). gamsel: Fit regularization path for generalized additive models. R package version, 1(1). [Google Scholar]
  8. Cook RD and Weisberg S (1991). Comment. Journal of the American Statistical Association, 86(414):328–332. [Google Scholar]
  9. Cui H, Li R, and Zhong W (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110(510):630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J and Fan Y (2008). High dimensional classification using features annealed independence rules. Annals of Statistics, 36(6):2605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fan J, Fan Y, and Wu Y (2011), “High-Dimensional Classification,” in High-Dimensional Data Analysis, eds by Cai T and Shen X, World Scientific, Singapore. pp. 3–37. [Google Scholar]
  12. Fan J, Feng Y, and Song R (2011b). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494):544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360. [Google Scholar]
  14. Fan J and Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fan J and Lv J (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101–148. [PMC free article] [PubMed] [Google Scholar]
  16. Fan J, Samworth R, and Wu Y (2009). Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research, 10(Sep):2013–2038. [PMC free article] [PubMed] [Google Scholar]
  17. Friedman JS, Ray JW, Waseem N, Johnson K, Brooks MJ, Hugosson T, Breuer D, Branham KE, Krauth DS, Bowne SJ, et al. (2009). Mutations in a btb-kelch protein, klhl7, cause autosomal-dominant retinitis pigmentosa. The American Journal of Human Genetics, 84(6):792–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Genovese CR, Jin J, Wasserman L, and Yao Z (2012). A comparison of the lasso and marginal regression. The Journal of Machine Learning Research, 13(1):2107–2143. [Google Scholar]
  19. Huang D, Li R, and Wang H (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics, 32(2):237–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Huang J, Horowitz JL, and Wei F (2010). Variable selection in nonparametric additive models. Annals of Statistics, 38(4):2282–2313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hui G and Lindsay BG (2010). Projection pursuit via white noise matrices. Sankhya B, 72(2):123–153. [Google Scholar]
  22. Ji P and Jin J (2012). Ups delivers optimal phase diagram in high-dimensional variable selection. The Annals of Statistics, pages 73–103. [Google Scholar]
  23. Jin J, Zhang C-H, and Zhang Q (2014). Optimality of graphlet screening in high dimensional variable selection. The Journal of Machine Learning Research, 15(1):2723–2772. [Google Scholar]
  24. Ke T, Jin J, and Fan J (2014). Covariance assisted screening and estimation. Annals of statistics, 42(6):2202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Ritchie DM, Kernighan BW and Lesk ME (1988). The C programming language. Englewood Cliffs: Prentice Hall. [Google Scholar]
  26. Ledoit O and Wolf M (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411. [Google Scholar]
  27. Li K-C (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327. [Google Scholar]
  28. Li R, Zhong W, and Zhu L (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lindsay BG and Yao W (2012). Fisher information matrix: A tool for dimension reduction, projection pursuit, independent component analysis, and more. Canadian Journal of Statistics, 40(4):712–730. [Google Scholar]
  30. Liu J, Li R, and Wu R (2014). Feature selection for varying coefficient models with ultrahigh-dimensional covariates. Journal of the American Statistical Association, 109(505):266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Liu J, Zhong W, and Li R (2015). A selective overview of feature screening for ultrahigh-dimensional data. Science China Mathematics, 58(10):1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. MATLAB (2020). version 9.9.0.1467703 (R2020b). The MathWorks Inc., Natick, Massachusetts. [Google Scholar]
  33. R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  34. Schäfer J and Strimmer K (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1):1175–1189. [DOI] [PubMed] [Google Scholar]
  35. Scheetz TE, Kim K-YA, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, and Stone EM (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Shao X and Zhang J (2014). Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association, 109(507):1302–1318. [Google Scholar]
  37. Silverman BW (2018). Density estimation for statistics and data analysis. Routledge. [Google Scholar]
  38. Székely GJ, Rizzo ML, and Bakirov NK (2007). Measuring and testing dependence by correlation of distances. Annals of Statistics, 35(6):2769–2794. [Google Scholar]
  39. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. [Google Scholar]
  40. Wang H and Xia Y (2008). Sliced regression for dimension reduction. Journal of the American Statistical Association, 103(482):811–821. [Google Scholar]
  41. Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Wang X and Leng C (2016). High dimensional ordinary least squares projection for screening variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(3):589–611. [Google Scholar]
  43. Wasserman L and Roeder K (2009). High dimensional variable selection. Annals of statistics, 37(5A):2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Wen Y, Locke KG, Klein M, Bowne SJ, Sullivan LS, Ray JW, Daiger SP, Birch DG, and Hughbanks-Wheaton DK (2011). Phenotypic characterization of 3 families with autosomal dominant retinitis pigmentosa due to mutations in klhl7. Archives of Ophthalmology, 129(11):1475–1482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Yao W, Nandy D, Lindsay BG, and Chiaromonte F (2019). Covariate information matrix for sufficient dimension reduction. Journal of the American Statistical Association, 114(528):1752–1764. [Google Scholar]
  46. Zhang C-H (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):894–942. [Google Scholar]
  47. Zhong W and Zhu L (2015). An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85(11):2331–2345. [Google Scholar]
  48. Zhu L-P, Li L, Li R, and Zhu L-X (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496):1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Zou H and Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

RESOURCES