MARGINAL EMPIRICAL LIKELIHOOD AND SURE INDEPENDENCE FEATURE SCREENING

Jinyuan Chang; Cheng Yong Tang; Yichao Wu

doi:10.1214/13-AOS1139

. Author manuscript; available in PMC: 2014 Jan 10.

Published in final edited form as: Ann Stat. 2013 Aug 1;41(4):10.1214/13-AOS1139. doi: 10.1214/13-AOS1139

MARGINAL EMPIRICAL LIKELIHOOD AND SURE INDEPENDENCE FEATURE SCREENING

Jinyuan Chang ^1,¹, Cheng Yong Tang ¹, Yichao Wu ^1,²

PMCID: PMC3887322 NIHMSID: NIHMS535968 PMID: 24415808

Abstract

We study a marginal empirical likelihood approach in scenarios when the number of variables grows exponentially with the sample size. The marginal empirical likelihood ratios as functions of the parameters of interest are systematically examined, and we find that the marginal empirical likelihood ratio evaluated at zero can be used to differentiate whether an explanatory variable is contributing to a response variable or not. Based on this finding, we propose a unified feature screening procedure for linear models and the generalized linear models. Different from most existing feature screening approaches that rely on the magnitudes of some marginal estimators to identify true signals, the proposed screening approach is capable of further incorporating the level of uncertainties of such estimators. Such a merit inherits the self-studentization property of the empirical likelihood approach, and extends the insights of existing feature screening methods. Moreover, we show that our screening approach is less restrictive to distributional assumptions, and can be conveniently adapted to be applied in a broad range of scenarios such as models specified using general moment conditions. Our theoretical results and extensive numerical examples by simulations and data analysis demonstrate the merits of the marginal empirical likelihood approach.

Key words and phrases: Empirical likelihood, high-dimensional data analysis, sure independence screening, large deviation

1. Introduction

High-dimensional data are more frequently encountered in current practical problems of finance, biomedical sciences, geological studies and many more areas. Statistical methods for high-dimensional data analysis have received increasing interests to deal with large volume of data containing considerably many features; see Bühlmann and van de Geer (2011), Hastie, Tibshirani and Friedman (2009) and Fan and Lv (2010) for overviews. A fundamental objective of statistical analysis with high-dimensional data is to identify relevant features, so that effective models can be subsequently constructed and applied to solve practical problems.

Recently, independence feature screening methods have been considered, see, for example, Fan and Lv (2008), Fan and Song (2010) and Fan, Feng and Song (2011) for linear models, generalized linear models and nonparametric additive models, respectively. Fan and Lv (2008) and Fan and Song (2010) performed screening by ranking the absolute values of marginal estimates of model coefficients, and Fan, Feng and Song (2011) carried out screening by ranking integrated squared marginal nonparametric curve estimates. Fan and Song (2010) also discussed independence screening by examining the magnitudes of the likelihood ratios. More recently, Wang (2012) considered a sure independence screening by a factor profiling approach; Xue and Zou (2011) studied sure independence screening and sparse signal recovery; see also Zhu et al. (2011) and Li, Zhong and Zhu (2012) for recent development using model-free approaches for feature screening, Li et al. (2012) for a robust rank correcation based approach, and Zhao and Li (2012) for an estimating equation based feature screening approach.

The empirical likelihood approach [Owen (1988, 2001)] is demonstrated effective in scenarios with less restrictive distributional assumptions for statistical inferences; see Qin and Lawless (1994), Newey and Smith (2004) and reference therein. We refer to Chen and Van Keilegom (2009) as a review and discussion of recent development in the empirical likelihood approach. The scope of the empirical likelihood approach recently has also been extended to deal with high-dimensional data; see Hjort, McKeague and Van Keilegom (2009), Chen, Peng and Qin (2009), Tang and Leng (2010), Leng and Tang (2012), and Chang, Chen and Chen (2013). Though demonstrated effective in statistical inferences, the empirical likelihood approach encounters substantial difficulty when data dimensionality is high. More specifically, the data dimensionality p cannot exceed the sample size n in the conventional empirical likelihood construction. In addition, p can be at most o(n^1/2) or even slower under which asymptotic properties are established [Chang, Chen and Chen (2013), Chen, Peng and Qin (2009), Hjort, McKeague and Van Keilegom (2009), Leng and Tang (2012), Tang and Leng (2010)]. Therefore, to practically more effectively apply the empirical likelihood approach, a pre-screening procedure is necessary to reduce the candidates of target features.

In this study, we systematically examine the properties of a marginal empirical likelihood approach where the available features are assessed one at a time individually. The marginal empirical likelihood approach only involves univariate optimizations, so that it provides a convenient device for both theoretical analysis and practical implementation. Our analysis reveals the probabilistic behavior of the marginal empirical likelihood ratios as functions of the parameters of interest that can be evaluated at arbitrary values, which itself is a problem of individual interest because existing studies of the empirical likelihood approach generally focus on its properties when evaluated at the truth, or at values in a small neighborhood of the truth. Based on our finding, we propose to conduct feature screening by using the marginal empirical likelihood ratio evaluated at zero. We find that a unified screening procedure can be applied in both linear models and generalized linear models. We also demonstrate how the marginal empirical likelihood approach can be conveniently adapted to solve a broad range of problems for models specified by general moment conditions. Hence, the marginal empirical likelihood approach provides a general and adaptive procedure for solving a broad class of practical problems for feature screening. Our theoretical analyses show that the proposed screening procedure based on the marginal empirical likelihood approach is selection consistent—that is, being able to identify the features that contribute to the response variable when the number of explanatory variables p grows exponentially with sample size n.

Our study contributes to the sure independence feature screening for high-dimensional data analysis from the following two substantial aspects. First of all, a fundamental difference of our approach to all existing approaches is that the marginal empirical likelihood ratio statistic is a self-studentized quantity [Owen (2001)] while other existing screening methods generally rely on the ranking of features based on magnitudes of some marginal estimators. Therefore, our approach is able to incorporate additionally the level of uncertainties associated with the estimators to conduct feature screening. This clearly extends the scope of existing feature screening approaches by considering more aspects of marginal statistical approaches. We show in our simulation studies that when heterogeneity exists in the conditional variance, our approach performs much better than a least-squares based approach. Second, our screening procedure inherits the non-parametric merits of the empirical likelihood approach. Specifically, our approach requires no strict distributional assumptions such as normally distributed errors in the linear models, or exponential family distributed response in the generalized linear models. This generalizes the scope and applicability of our approach. As a result, we show that the marginal empirical likelihood approach provides a unified framework for feature screening in linear regression models and generalized linear models, and can be conveniently applied for solving a broad class of general problems.

The rest of this paper is organized as follows. We elaborate the method of the marginal empirical likelihood approach in Section 2. Properties of the proposed approach are given in Section 3. Section 4 extends the marginal empirical likelihood approach to a broad framework including models specified by general moment conditions, and presents an iterative sure screening procedure using profile empirical likelihood. Numerical examples are given in Section 5. We conclude with some discussions in Section 6. All technical details are contained in the supplementary material of this paper [Chang, Tang and Wu (2013)].

2. Methodology

2.1. Marginal empirical likelihood for linear models

Let us motivate the marginal empirical likelihood approach by first considering the multiple linear regression model

Y = X^{T} β + ε,

(2.1)

where X = (X₁, …, X_p)^T is the vector of explanatory variables, ε is the random error with zero mean, and β = (β₁,…, β_p)^T is the vector of unknown parameters. Hereinafter, we also use β to denote the truth of the parameter whenever no confusion arises. Without loss of generality, we assume hereinafter that the explanatory variables are standardized such that Inline graphic (X_j) = 0 and $E (X_{j}^{2}) = 1$ (j = 1,…, p). For effective and interpretable practical applications, one may reasonably expect that among the large number of explanatory variables, only a small fraction of them contribute to the response variable. We therefore denote by = {1 ≤ j ≤ p :β_j ≠ 0} the collection of the effective explanatory variables in the true sparse model whose size is characterized by its cardinality s = | Inline graphic |. Here we assume that s is much smaller than p, reflecting the case in many practical applications like in finance, biology and clinical studies.

In the recent literature of high-dimensional data analysis, various marginal approaches have been applied for locating the true model Inline graphic ; see, for example, Fan and Lv (2008), Fan and Song (2010) and Fan, Feng and Song (2011). Among those approaches, a popular way is to assess the marginal contribution from a given explanatory variable X_j. Commonly applied criteria for measuring the marginal contribution are the magnitudes of some marginal estimators [Fan, Feng and Song (2011), Fan and Lv (2008), Fan and Song (2010)]. Subsequently, the candidate models are chosen from the top ranked explanatory variables.

To apply a marginal empirical likelihood approach for the linear regression model (2.1), let us consider the marginal moment condition of the least squares estimator:

E {X_{j} (Y - X_{j} β_{j}^{M})} = 0 (j = 1, \dots, p),

(2.2)

where $β_{j}^{M}$ is interpreted as the marginal contribution of covariate X_j to Y. From (2.2), we can see that $β_{j}^{M} = E (X_{j} Y)$ is the covariance between X_j and Y so that $β_{j}^{M} = 0$ is equivalent to that Y and X_j are marginally uncorrelated. Here we note the remarkable difference between $β_{j}^{M}$ and β_j where the latter is the truth of the parameter in (2.1). In general, $β_{j}^{M} \neq β_{j}$ unless Inline graphic (X_iX_j) = 0 for all i ≠ j. In addition to that from β_j in the model (2.1), $β_{j}^{M}$ also contains aggregated contribution from other components that may be correlated with X_j. Thus, the correlation level among covariates has significant impact on the performance of a screening procedure based on (2.2); more discussions on this are given in a later section containing the main results.

A marginal empirical likelihood for linear models can be constructed as follows. Note that $E (X_{j}^{2}) = 1$ , therefore (2.2) is equivalent to

E (X_{j} Y - β_{j}^{M}) = 0.

(2.3)

Let ${(X_{i}, Y_{i})}_{i = 1}^{n}$ be collected independent data, g_ij(β) = X_ijY_i−β (j = 1,…, p) and X_ij means the j th component of the ith observation X_i. Based on (2.3), we define the following marginal empirical likelihood:

{EL}_{j} (β) = sup {\prod_{i = 1}^{n} w_{i} : w_{i} \geq 0, \sum_{i = 1}^{n} w_{i} = 1, \sum_{i = 1}^{n} w_{i} g_{i j} (β) = 0}

(2.4)

for j = 1,…, p. For any given β in the convex hull of ${X_{i j} Y_{i}}_{i = 1}^{n}$ , the marginal empirical likelihood ratio is defined as

ℓ_{j} (β) = - 2 log {{EL}_{j} (β)} - 2 n log n = 2 \sum_{i = 1}^{n} log {1 + λ g_{i j} (β)},

(2.5)

where λ is the Lagrange multiplier satisfying

0 = \sum_{i = 1}^{n} \frac{g_{i j} (β)}{1 + λ g_{i j} (β)} .

(2.6)

2.2. Extended coverage to generalized linear models

A merit of the marginal empirical likelihood approach is that the formulation by (2.4) and (2.5) only requires the moment condition (2.3), rather than specific distributional assumption of ε in model (2.1). This entitles our approach robustness against the violation of distributional model assumptions, and thus it can be extended and adapted to a broader framework. Now we elaborate how the above marginal empirical likelihood approach can be equally applied when the response variable Y is in the exponential family with the density function taking the canonical form [McCullagh and Nelder (1989)]:

f (y) = exp {y θ - b (θ) + c (y)}

(2.7)

for some suitable known functions b(·), c(·) and canonical parameter θ. Further extensions of the marginal empirical likelihood approach are discussed in a later section. We refer to Kolaczyk (1994) and Chen and Cui (2003) for conventional applications of the empirical likelihood to generalized linear models. Following the convention of generalized linear models, we denote the mean function by μ = Inline graphic (Y |X) = b′(θ) where θ is modeled by a linear function β₀ + X^Tβ with β = (β₁,…, β_p)^T, and use V(μ) to denote the variance of Y expressed as a function of μ.

For any j = 1,…, p, the moment condition based on the marginal likelihood approach in Fan and Song (2010) for β_j is

E {\frac{Y - μ_{j}}{V (μ_{j})} \frac{\partial μ_{j}}{\partial β_{j}}} = 0,

(2.8)

where $μ_{j} = b^{'} (β_{0} + β_{j}^{M} X_{j})$ is the implied mean function that is modeled marginally only using X_j. Here the $β_{j}^{M}$ is again interpreted as the marginal contribution of X_j to the response variable Y ; see also Fan and Song (2010). By the property of the exponential family distribution, $\frac{\partial μ}{\partial β_{j}} = X_{j} b^{″} (θ)$ and V(μ) = b″(θ). Then (2.8) becomes

E {X_{j} (Y - μ_{j})} = 0.

(2.9)

For linear models, $b (θ) = \frac{θ^{2}}{2}$ , then b′ (θ) = θ, so that (2.9) becomes $E {X_{j} (Y - X_{j} β_{j}^{M})} = 0$ by noting that β₀ = 0 in the linear model case, which is exactly (2.2). Hence, (2.9) is a natural extension of (2.2) in the generalized linear models.

One way to apply the marginal approach can be generalizing the definition in (2.4) to be g_ij (β) = X_ij{Y_i − b′(β₀ + βX_ij)} (j = 1,…, p). However, such a modification is actually not necessary. To see this, we note that when the marginal contribution $β_{j}^{M} = 0$ , then the marginal moment condition (2.9) becomes Inline graphic [X_j {Y − b′(β₀)}] = 0. Hence, it implies that the covariance between X_j and Y is 0, which exactly shares the same implication of (2.3) as in the linear models. From this perspective, (2.9) and (2.3) are essentially equivalent. Additionally, the response variable in practice can always be centered to have zero mean. This fact eliminates the concern on the intercept β₀ in the generalized linear models when considering a marginal empirical likelihood approach. As a result, we conclude that a unified marginal empirical likelihood construction (2.4) with the same g_ij (β) = X_ijY_i −β can be equally applied for both linear models and generalized linear models with centered response variable Y. The implication of this unified construction is also intuitively very clear by interpreting β as the covariance between a covariate and the response variable.

Furthermore, we note that the distributional assumption (2.7) is actually not required in our marginal empirical likelihood approach. Therefore our approach is not restricted to the exponential family (2.7). Since we only require the marginal moment condition (2.9), our approach can be applied with the quasi-likelihood approach and it also works with misspecified variance functions [McCullagh and Nelder (1989)].

The marginal empirical likelihood ratio (2.5) with g_ij(β) = X_ijY_i −β evaluated at β = 0—that is, ℓ_j (0)—has a very clear practical interpretation by noting that it can be used to test the null hypothesis $H_{0} : β_{j}^{M} = 0$ . By noting additionally the intuitively clear fact that ℓ_j (0) should not be large if $β_{j}^{M} = 0$ , we can see that ℓ_j (0) can be used as a device for feature screening. More specifically, we have the following procedure:

Step 1: Evaluating ℓ_j (0) for all j = 1,…, p, where ℓ_j (·) is defined in (2.5) with g_ij (β) = X_ijY_i − β. If 0 is not in the convex hull of ${X_{i j} Y_{i}}_{i = 1}^{n}$ , we define ℓ_j (0)=∞ as a strong evidence of significance in predicting Y using X_j.
Step 2: Given a threshold level γ_n, select a set of variables by

{\hat{M}}_{γ_{n}} = {1 \leq j \leq p : ℓ_{j} (0) \geq γ_{n}} .

We specify in the next section the requirement for γ_n so that the screening procedure is consistent. On the other hand, however, explicitly identifying γ_n in practice is generally difficult because it involves unknown constants. Thus, a screening procedure can be practically implemented in a way such that Inline graphic recruits candidate features until certain size such as n^1/2 is achieved.

We remark that the evaluation of ℓ_j (β) in (2.5) in practice is actually very easy by noting that all optimizations involved are univariate, which is very convenient for practical applications. On the other hand, our procedure only needs to evaluate the marginal empirical likelihood ratio (2.5) at β = 0 and avoids the estimation of $β_{j}^{M}$ when conducting the feature screening.

3. Main results

Now we present main results for the marginal empirical likelihood ratio in (2.5) with the unified specification g_ij (β) = X_ijY_i − β that are generally applicable for both linear models and generalized linear models. In our discussion hereinafter, let ρ_j = Inline graphic (X_jY). If ρ_j = 0, it is well known that ℓ_j (0) is asymptotically chi-square distributed with 1 degree of freedom [Owen (1988, 2001)]. If ρ_j ≠ 0, however, the properties of ℓ_j (0) is generally less clear, which is also a question of independent interest. Specifically, if β = ρ_j +τσn^−1/2 where σ² = var(X_jY), it can be shown following the same argument of Owen (1988) that $ℓ_{j} (β) \overset{d}{\to} χ_{1}^{2} (τ^{2})$ as n→∞ under some regularity conditions where τ² is a noncentrality parameter. But if β − ρ_j converges to zero at a rate slower than n^−1/2, the exact diverging rate of ℓ_j(β) is less clear in existing literature.

We first present a general result that shows that the empirical likelihood ratio ℓ_j(β) is no longer O_p(1) when β −ρ_j converges to 0 but n^1/2(β −ρ_j) diverges.

Proposition 1

Suppose that U₁,…, U_n are independent and identically distributed random variables with Inline graphic (|U_i |^ν) <∞ for some ν ≥ 3. Replacing g_ij (β) in (2.5) and (2.6) by U_i − μ for all i = 1,…, n, we obtain ℓ(μ). If |μ − μ₀| = O(n⁻^w) for some $w \in (\frac{1}{ν}, \frac{1}{2})$ , then

\frac{ℓ (μ)}{n {(μ - μ_{0})}^{2} σ^{- 2}} \overset{p}{\to} 1 a s n \to \infty,

where μ₀ = Inline graphic (U_i) and σ² = {(U_i −μ₀)²}.

We note that Chen, Gao and Tang (2008) contains a related result showing that the empirical likelihood ratio is diverging when evaluated at values far enough from the truth. Our Proposition 1 contains the specific diverging rate of the empirical likelihood ratio. Proposition 1 implies that if β − ρ_j converges to zero at a rate slower than n^−1/2, ℓ_j (β) = O_p{n(β − ρ_j)²}. On the other hand, if β − ρ_j does not weaken to zero, our Theorem 1 presented later shows that ℓ_j(β) has high probability to take large value. On the other hand, as clearly shown in our proof of Proposition 1 given in Chang, Tang and Wu (2013), the statistics ℓ_j (0) is self-studentized, and hence it incorporates the level of uncertainties from using the finite sample moment conditions. Such a feature is desirable because in practice levels of uncertainties corresponding to different covariates can be different when contributing to the response variable of interest. This may confound the ranking for feature screening based on marginal estimators themselves without considering their standard errors, not mentioning incorporating the level of uncertainties is difficult especially when handling high-dimensional statistical problems.

An effective marginal screening procedure requires two conditions: (i) if j ∈ Inline graphic , then ρ_j takes nonnegligible value; and (ii) if j ∉ , then ρ_j takes negligible value. Actually, the first requirement is closely related to recruiting the true signals that contribute to the response, and the second one affects the size of selected variable set that may contain false signals. Fan and Lv (2008) shows that under the identification condition min_j_∈ |ρ_j| ≥ f_n > 0 for some function f_n, the first requirement is fulfilled. A common assumption for f_n is f_n = O(n⁻^κ) for some $κ \in (0, \frac{1}{2})$ .

Our next theoretical analysis imposes the following two assumptions:

A.1: The random variable Y has bounded variance and there exists a positive constant c₁ such that
$min_{j \in M_{*}} ∣ E (X_{j} Y) ∣ = min_{j \in M_{*}} ∣ cov (Y, X_{j}) ∣ \geq c_{1} n^{- κ}$

for some $κ \in [0, \frac{1}{2})$ .
A.2: There are positive constants K₁, K₂, γ₁ and γ₂ such that
$\begin{array}{l} ℙ {∣ X_{j} ∣ \geq u} \leq K_{1} exp (- K_{2} u^{γ_{1}}) & for each j = 1, \dots, p and any u > 0, \\ ℙ {∣ Y ∣ \geq u} \leq K_{1} exp (- K_{2} u^{γ_{2}}) & for any u > 0. \end{array}$

Assumption A.1 can be viewed as a requirement for the minimal signal strength, and we call it the identification condition for j ∈ Inline graphic . For linear models, the assumption A.1 is same as condition 3 in Fan and Lv (2008) that is commonly assumed in sure independence feature screening. For generalized linear models, Fan and Song (2010) imposes the identification condition as min_j_∈|cov(b′(X^Tβ), X_j)| ≥ c₁n⁻^κ. By noticing that cov(b′(X^Tβ), X_j) = Inline graphic (X_jY), their identification condition for j ∈ is also same as A.1. Since we impose no distributional assumptions, A.2 is assumed to ensure the large deviation results that are used to get the exponential convergence rate. The first part of A.2 is same as the first part of condition D in Fan and Song (2010). For linear regression model, the second part of condition D in Fan and Song (2010) is equivalent to that X^Tβ satisfies the Cramér condition such that there exists a positive constant H such that Inline graphic {exp(tX^Tβ)} <∞ for any |t |<H. If the error ε is independent of covariates and satisfies the Cramér condition, then we can obtain that the variable Y also satisfies the Cramér condition. From Lemma 2.2 in Petrov (1995), a random variable W satisfies Cramér condition is equivalent to that there are positive constants b₁ and b₂ such that ℙ{|W| ≥ u} ≤ b₁ exp(−b₂u) for any u > 0. Therefore, our assumption here is actually weaker than that in Fan and Song (2010). On the other hand, A.2 is also a general technical assumption in the literature of large derivations. For example, γ₁ = 2 if X_j’s follow normal distribution or sub-Gaussian distribution, and γ₁=∞ if X_j’s have compact support.

We now establish the following general result for the distribution of empirical likelihood ratio which is the foundation for our future theoretical results.

Theorem 1

Suppose that U₁,…, U_n are independent and identically distributed random variables. Assume that there exist three positive constants K̃₁, K̃₂ and γ such that ℙ{|U_i|>u} ≤ K̃₁ exp(−K̃₂u^γ) for all u>0. Define μ₀ = Inline graphic (U_i), $δ = max {\frac{2}{γ} - 1, 0}$ , H = 2¹⁺^δ and $\bar{Δ} = \frac{n^{1 / 2} σ}{2 K}$ , where σ² = {(U_i − μ₀)²} and K >σ is a sufficiently large positive constant depending only on K̃₁, K̃₂, γ and μ₀, then for L → ∞, there exists a positive constant C only depending on K̃₁, K̃₂ and γ such that

ℙ {ℓ (μ) < \frac{n {(μ - μ_{0})}^{2}}{L^{2}}} \leq {\begin{cases} exp {- \frac{n {(μ - μ_{0})}^{2}}{4 H σ^{2}}} + exp (- {C L}^{γ}), & i f n^{1 / 2} ∣ μ - μ_{0} ∣ \leq σ {(H^{1 + δ} \bar{Δ})}^{1 / (1 + 2 δ)}; \\ exp {- \frac{1}{4} {(\frac{n ∣ μ - μ_{0} ∣}{2 K})}^{1 / (1 + δ)}} + exp (- {C L}^{γ}), & i f n^{1 / 2} ∣ μ - μ_{0} ∣ > σ {(H^{1 + δ} \bar{Δ})}^{1 / (1 + 2 δ)}; \end{cases}

where ℓ(μ) is defined in Proposition 1.

The proof of Theorem 1 is given in Chang, Tang and Wu (2013), where the main idea is applying large deviation theory [Petrov (1995), Saulis and Statulevičius (1991)].

Theorem 1 reveals the magnitude of the empirical likelihood ratio statistic evaluated at arbitrary values. When μ −μ₀ does not diminish to 0, Theorem 1 implies that the empirical likelihood ratio statistic diverges with large probability where the diverging rate synthetically depends on the sample size n, some diverging L and the deviation of μ from the truth. Here L is a general technical device whose diverging rate is arbitrary. As a direct result of Theorem 1, we have the following proposition for ℓ_j (0).

Proposition 2

Under assumptions A.1 and A.2, there exists a positive constant C₁ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 such that for any j ∈ Inline graphic and L→∞,

ℙ {ℓ_{j} (0) < \frac{c_{1}^{2} n^{1 - 2 κ}}{L^{2}}} \leq {\begin{cases} exp (- C_{1} n^{1 - 2 κ}) + exp (- C_{1} L^{γ}), & i f (1 - 2 κ) (1 - 2 δ) < 1; \\ exp (- C_{1} n^{(1 - κ) / (1 + δ)}) + exp (- C_{1} L^{γ}), & i f (1 - 2 κ) (1 + 2 δ) \geq 1; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ and $δ = max {\frac{2}{γ} - 1, 0}$ .

Proposition 2 is a uniform result for all features contributing in the true model. Specifically, with large probability and uniformly for all j ∈ Inline graphic , the diverging rate of ℓ_j (0) is not slower than n¹⁻²^κL⁻². From Proposition 1, if | (X_jY)| = O(n⁻^w) for some $w \in (0, \frac{1}{2})$ and some j ∈ , then $ℓ_{j} (0) \overset{p}{\to} \infty$ . This can be viewed as a requirement such that the signal strength cannot diminish to 0 at a too fast rate. Therefore, n^1/2−^κL⁻¹→∞ as n→∞ is required for sure independence screening. By choosing L = n^1/2−^κ⁻^τ for some $τ \in (0, \frac{1}{2} - κ)$ , we obtain the following corollary more specifically summarizing that the set Inline graphic can be distinguished by examining the marginal empirical likelihood ratio ℓ_j (0) (j = 1,…, p).

Corollary 1

Under assumptions A.1 and A.2, there exists a positive constant C₁ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 such that, for any $τ \in (0, \frac{1}{2} - κ)$ ,

max_{j \in M_{*}} ℙ {ℓ_{j} (0) < c_{1}^{2} n^{2 τ}} \leq {\begin{cases} exp {- C_{1} n^{(1 - 2 κ) \land ((1 - 2 κ - 2 τ) γ / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) < 1; \\ exp {- C_{1} n^{((1 - κ) / (1 + δ)) \land ((1 - 2 κ - 2 τ) γ / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) \geq 1; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ and $δ = max {\frac{2}{γ} - 1, 0}$ .

Summarizing above results, we formally establish the screening properties of the marginal empirical likelihood approach.

Theorem 2

Under assumptions A.1 and A.2, there exists a positive constant C₁ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 such that, for any $τ \in (0, \frac{1}{2} - κ)$ and $γ_{n} = c_{1}^{2} n^{2 τ}$ ,

ℙ {M_{*} \subset {\hat{M}}_{γ_{n}}} \geq {\begin{cases} 1 - s exp {- C_{1} n^{(1 - 2 κ) \land ((1 - 2 κ - 2 τ) γ / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) < 1; \\ 1 - s exp {- C_{1} n^{((1 - κ) / (1 + δ)) \land ((1 - 2 κ - 2 τ) γ / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) \geq 1; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ and $δ = max {\frac{2}{γ} - 1, 0}$ .

Theorem 2 implies the sure screening property for our procedure with nonpolynomial dimensionality:

log p = {\begin{cases} o (n^{(1 - 2 κ) \land ((1 - 2 κ - 2 τ) γ / 2)}), & if (1 - 2 κ) (1 + 2 δ) < 1; \\ o (n^{((1 - κ) / (1 + δ)) \land ((1 - 2 κ - 2 τ) γ / 2)}), & if (1 - 2 κ) (1 + 2 δ) \geq 1. \end{cases}

When the covariates and error are normal, γ₁ = 2 and γ₂ = 2. Then γ = 1, δ = 1 and log p = o(n^1/2−^κ) which is weaker than that in Fan and Lv (2008) where log p = o(n¹⁻²^κ) is allowed. This can be viewed as a price paid for allowing nonnormal covariate and more general error distribution. Furthermore, we compare our result and that in Fan and Song (2010). The Lemma 1 in Fan and Song (2010) means that γ₂ = 1. The corresponding parameters under their this setting are $γ = \frac{γ_{1}}{γ_{1} + 1}$ and $δ = \frac{γ_{1} + 2}{γ_{1}}$ , respectively. Then, we can handle the nonpolynomial dimensionality

log p = o (n^{(1 - 2 κ) γ_{1} / (2 γ_{1} + 2)})

in this setting, which is actually a stronger result than that in Fan and Song (2010) where log p = o(n^{(1−2κ)γ₁/A)} and A = max{γ₁ + 4, 3γ₁ +2}.

Now we investigate how large the set Inline graphic is. This question is closely related to the asymptotic property of ℓ_j (0) for j ∉ . Essentially, we need to know the magnitudes of ℓ_j (0) for j ∉ . We first consider the simple case ρ_j = 0 for any j ∉ and have the following result.

Proposition 3

Under assumptions A.1 and A.2, if ρ_j = 0, there is a positive constant C₂ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 such that, for any $τ \in (0, \frac{1}{2} - κ)$ ,

ℙ {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}} \leq {\begin{cases} exp (- C_{2} n^{2 τ}), & i f γ < 4 and τ \leq \frac{γ}{12}; \\ exp (- C_{2} n^{γ / 6}), & i f γ < 4 and τ > \frac{γ}{12}; \\ exp (- C_{2} n^{2 τ}), & i f γ \geq 4 and τ \leq \frac{γ}{2 γ + 4}; \\ exp (- C_{2} n^{γ / (γ + 2)}), & i f γ \geq 4 and τ > \frac{γ}{2 γ + 4}; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ .

The assumption ρ_j = 0 for any j ∉ Inline graphic can be guaranteed by the partial orthogonality condition, that is, {X_j :j ∉ } is independent of {X_j : j ∈ }. The orthogonality condition is essentially the assumption made in Huang, Horowitz and Ma (2008) who showed the model selection consistency in the case with the ordinary linear model and bridge regression. This proposition gives the property of ℓ_j (0) for any j ∉ Inline graphic which can be used to establish the theoretical result for the size of where $γ_{n} = c_{1}^{2} n^{2 τ}$ . Note that

\begin{array}{l} ∣ {\hat{M}}_{γ_{n}} ∣ = \sum_{j \in M_{*}} I {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}} + \sum_{j \notin M_{*}} I {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}} \\ \leq s + \sum_{j \notin M_{*}} I {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}}, \end{array}

then

ℙ {∣ {\hat{M}}_{γ_{n}} ∣ > s} \leq \sum_{j \notin M_{*}} ℙ {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}} .

By Proposition 3, we obtain the following theorem.

Theorem 3

Under assumptions A.1 and A.2, if ρ_j = 0 for any j ∉ Inline graphic , then there exists a positive constant C₂ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 such that, for any $τ \in (0, \frac{1}{2} - κ)$ and $γ_{n} = c_{1}^{2} n^{2 τ}$ ,

ℙ {∣ {\hat{M}}_{γ_{n}} ∣ > s} \leq {\begin{cases} p exp (- C_{2} n^{2 τ}), & i f γ < 4 and τ \leq \frac{γ}{12}; \\ p exp (- C_{2} n^{γ / 6}), & i f γ < 4 and τ > \frac{γ}{12}; \\ p exp (- C_{2} n^{2 τ}), & i f γ \geq 4 and τ \leq \frac{γ}{2 γ + 4}; \\ p exp (- C_{2} n^{γ / (γ + 2)}), & i f γ \geq 4 and τ > \frac{γ}{2 γ + 4}; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ .

From Theorem 3, we have ℙ{| Inline graphic | > s} ≤ p exp{−C₂n⁽²^τ^{)^(}^γ^{/6)^(}^γ^/(^γ⁺²⁾⁾} which means that the event {| | ≤ s} occurs with probability approaching to 1 if log p = o(n⁽²^τ^{)^(}^γ^{/6)^(}^γ^/(^γ⁺²⁾⁾). On the other hand, following Theorem 2, we have ℙ{ ⊂ } → 1 provided log p = o(n⁽⁽¹⁻²^κ⁻²^τ⁾^γ^{/2)^(1−2}^κ⁾). Combining these two results together, we can obtain that

ℙ {{\hat{M}}_{γ_{n}} = M_{*}} \to 1 if log p = o (n^{(γ / 6) \land ((1 - 2 κ) γ / (γ + 2))})

and

τ = \frac{(1 - 2 κ) γ}{2 γ + 4} .

This property shows the selection consistency of our procedure. In a more general case without partial orthogonality condition, we could consider the size of the set Inline graphic under the setting

max_{j \notin M_{*}} ∣ ρ_{j} ∣ = o (n^{- κ}),

which is an assumption imposed in Fan and Song (2010).

Proposition 4

Under assumptions A.1 and A.2, if max_j_∉ |ρ_j| = O(n⁻^η) where η > κ and ${min}_{j \notin M_{*}} E (X_{j}^{2} Y^{2}) \geq c_{2}$ for some c₂ > 0, there exists a positive constant C₃ depending only on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 and c₂ such that, for any j ∉ Inline graphic and $τ \in (\frac{1}{2} - η, \frac{1}{2} - κ)$ ,

ℙ {ℓ_{j} (0) \geq c_{1}^{2} n^{2 τ}} \leq {\begin{cases} exp (- C_{3} n^{2 τ}) + exp (- C_{3} n^{γ / 6}), & i f γ < 2 and η > \frac{1}{4}; \\ exp (- C_{3} n^{γ η}) + exp (- C_{3} n^{γ / 6}), & i f γ < 2 and η \leq \frac{1}{4}; \\ exp (- C_{3} n^{γ η}) + exp (- C_{3} n^{γ / 6}), & i f γ \geq 2 and η \leq \frac{1}{γ + 2}; \\ exp (- C_{3} n^{γ / (γ + 2)}) + exp (- C_{3} n^{2 τ}), & i f γ \geq 4 and η > \frac{1}{γ + 2}; \\ exp (- C_{3} n^{γ / 6}) + exp (- C_{3} n^{2 τ}), & i f 2 \leq γ < 4 and η > \frac{1}{γ + 2}; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ .

If ρ_j = 0 for any j ∉ Inline graphic , then η = ∞. Hence, this proposition reduces to Proposition 3. Following the same argument between Proposition 3 and Theorem 3, we can obtain the following theorem related to the size of .

Theorem 4

Under assumptions A.1 and A.2, if max_j _∉ |ρ_j| = O(n⁻^η) where η > κ and ${min}_{j \notin M_{*}} E (X_{j}^{2} Y^{2}) \geq c_{2}$ for some c₂ > 0, then there exists a positive constant C₃ only depending on K₁, K₂, γ₁ and γ₂ appeared in assumption A.2 and c₂ such that, for any $τ \in (\frac{1}{2} - η, \frac{1}{2} - κ)$ and $γ_{n} = c_{1}^{2} n^{2 τ}$ ,

ℙ {∣ {\hat{M}}_{γ_{n}} ∣ > s} \leq {\begin{cases} p exp (- C_{3} n^{2 τ}) + p exp (- C_{3} n^{γ / 6}), & i f γ < 2 and η > \frac{1}{4}; \\ p exp (- C_{3} n^{γ η}) + p exp (- C_{3} n^{γ / 6}), & i f γ < 2 and η \leq \frac{1}{4}; \\ p exp (- C_{3} n^{γ η}) + p exp (- C_{3} n^{γ / 6}), & i f γ \geq 2 and η \leq \frac{1}{γ + 2}; \\ p exp (- C_{3} n^{γ / (γ + 2)}) + p exp (- C_{3} n^{2 τ}), & i f γ \geq 4 and η > \frac{1}{γ + 2}; \\ p exp (- C_{3} η^{γ / 6}) + p exp (- C_{3} n^{2 τ}), & i f 2 \leq γ < 4 and η > \frac{1}{γ + 2}; \end{cases}

where $γ = \frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ .

In summary, our results show that the marginal empirical likelihood approach has a very good control of the size of the recruited variables. With large probability, the set of the recruited variables is not larger than the true contributing explanatory variables. As shown later in our simulation results, the marginal empirical likelihood approach perform very well in terms of the set of false selected variables by the marginal empirical likelihood approach.

4. Extensions

4.1. A broad framework

The marginal empirical likelihood can be applied in a general framework besides the linear models and generalized linear models. Based on general estimating equations approach [Hansen (1982), Qin and Lawless (1994)], we can also apply the screening procedure based on the marginal empirical likelihood. We will demonstrate that the marginal empirical likelihood approach provides an effective device to combine information that can be used to enhance the performance of a screening procedure.

Let Z_i ∈ ℝ^d (i = 1, …, n) be generic observations, β = (β₁, …, β_p)^T ∈ ℝ^p be parameter of interest and g(Z; β) = (g₁(Z; β), …, g_r(Z; β))^T be the r-dimensional estimating function such that Inline graphic {g(Z; β)} = 0. Let = {1 ≤ j ≤ p : β_j ≠ 0} be the true model with size | | = s. We are interested in how to construct a sure feature screening procedure to recover in the general estimating equation setting. To motivate the marginal empirical likelihood approach, let us consider the estimating function evaluated at

β^{(j)} = {(\underset{j - 1}{\underset{︸}{0, \dots, 0}}, β, \underset{p - j}{\underset{︸}{0, \dots, 0}})}^{T} (j = 1, \dots, p) .

In practice, many components in g(Z; β⁽^j⁾) do not involve the unknown parameter; see, for example, the estimating function constructed from the least-squares method and our example given later. Therefore, we denote by

g^{(j)} (Z; β) = {(g_{1}^{(j)} (Z; β), \dots, g_{r_{j}}^{(j)} (Z; β))}^{T}

an r_j (r_j ≥ 1)-dimensional estimating function collecting the components in g(Z; β⁽^j⁾) that depend on the unknown parameter. Usually r_j > 1 is small and not all components of Z are involved in g⁽^j⁾(Z; β). A remarkable advantage of this broad framework is that it provides a device for feature screening using more flexibly constructed conditions so that additional data information can be more effectively incorporated.

Correspondingly, we define the marginal empirical likelihood for β as

{EL}_{j} (β) = sup {\prod_{i = 1}^{n} w_{i} : w_{i} \geq 0, \sum_{i = 1}^{n} w_{i} = 1, \sum_{i = 1}^{n} w_{i} g^{(j)} (Z_{i}; β) = 0} .

(4.1)

Then screening can be done based on the ranking of EL_j(0) or equivalently using the corresponding marginal empirical likelihood ratio evaluated at 0—that is, ℓ_j(0). The steps of the procedure are the same as those described earlier. A concrete example of this scenario is given as follows.

Example (Quadratic inference function (QIF) approach [Qu, Lindsay and Li (2000)])

Longitudinal data arise commonly in biomedical research with repeated measurements from the same subject or within the same cluster. Let Y_it and X_it(i = 1, …, n, t = 1, …, m_i) be the response and covariates of the ith subject measured at time t. Let $E (Y_{i t}) = μ (X_{i t}^{T} β) = μ_{i t}$ where β ∈ ℝ^p is the parameter of interest. Incorporating the dependence among the repeated measurements is essential for efficient inference. Liang and Zeger (1986) proposed to estimate β by solving $0 = \sum_{i = 1}^{n} {\dot{μ}}_{i}^{T} W_{i}^{- 1} (Y_{i} - μ_{i})$ . Here for the ith subject, Y_i = (Y_i₁, …, Y_{im_i})^T, μ_i = (μ_i₁, …, μ_{im_i})^T, ${\dot{μ}}_{i} = \frac{\partial μ_{i}}{\partial β}$ and $W_{i} = v_{i}^{1 / 2} {Rv}_{i}^{1 / 2}$ , where v_i is a diagonal matrix of the conditional variances of subject i and R is a working correlation matrix that may depend on some unknown parameter. This approach uses estimating function $g (Z_{i}; β) = {\dot{μ}}_{i}^{T} W_{i}^{- 1} (Y_{i} - μ_{i})$ , where $Z_{i} = {(Z_{i 1}^{T}, \dots, Z_{{i m}_{i}}^{T})}^{T}, Z_{i t} = {(Y_{i t}, X_{i t}^{T})}^{T}$ and r = p. More recently, Qu, Lindsay and Li (2000) proposed to model R⁻¹ by $\sum_{i = 1}^{m} a_{i} M_{i}$ , where M₁, …, M_m are known matrices and a₁, …, a_m are unknown constants. Then β can be estimated by the quadratic inference functions approach [Qu, Lindsay and Li (2000)] that uses

g (Z_{i}; β) = (\begin{matrix} {\dot{μ}}_{i}^{T} v_{i}^{- 1 / 2} M_{1} v_{i}^{- 1 / 2} (Y_{i} - μ_{i}) \\ ⋮ \\ {\dot{μ}}_{i}^{T} v_{i}^{- 1 / 2} M_{m} v_{i}^{- 1 / 2} (Y_{i} - μ_{i}) \end{matrix}) (i = 1, \dots, n) .

(4.2)

This falls into our framework with r > p when m > 1, and with r = p if m = 1. When applying the marginal approach, we note that g⁽^j⁾(Z; β) is an m-dimensional estimating function. The marginal screening by empirical likelihood can be conveniently applied to this scenario, and we note that the existing independence screening methods cannot be directly applied when m > 1.

In a concurrent and independent work, Zhao and Li (2012) considered feature screening using estimating functions when r = p. By using our notations, their approach are based on g⁽^j⁾(Z; 0)—the marginal estimating function evaluated at 0. Their screening procedure are based on ranking the absolute value of g⁽^j⁾(Z; 0) for j = 1, …, p. Our approach is different as seen from the above marginal empirical likelihood construction. In addition, analogous to that in linear models and generalized linear models, the marginal empirical likelihood constructed from using the marginal estimating function is also capable of incorporating the level of uncertainties associated with finite sample estimating functions.

We now characterize the properties of the screening procedure in the framework of models specified by estimating equations. For any vector a =(a₁, …, a_q)^T ∈ ℝ^q, we use ||a||_∞ = max_i_{=1, …,} _q |a_q| and ${‖ a ‖}_{2} = {(\sum_{i = 1}^{q} a_{i}^{2})}^{1 / 2}$ to denote its L_∞ and L₂ norms, respectively. Aiming to establish the theoretical results, we need the following two assumptions.

A.3: There exists a positive constant c₃ such that
$min_{j \in M_{*}} {‖ E {g^{(j)} (Z; 0)} ‖}_{\infty} \geq c_{3} n^{- κ}$

for some $κ \in [0, \frac{1}{2})$ .
A.4: There are positive constants K₃, K₄ and γ₃ such that
$ℙ {{‖ g^{(j)} (Z; 0) ‖}_{2} \geq u} \leq K_{3} exp (- K_{4} u^{γ_{3}})$

for each j = 1, …, p and any u > 0.

Assumption A.3 is a general identification condition for the set Inline graphic when considering the broad framework of models specified by general estimating equations. It means that the weakest signals reflected by || {g⁽^j⁾(Z; 0)}||_∞ (j ∈ ) cannot vanish at a rate faster than n^−1/2. Assumption A.3 is not stringent, and it reduces to A.1 in special cases of linear models and generalized linear models. A similar assumption is also made in Zhao and Li (2012). Assumption A.4, which is a counterpart of A.2 in general cases, is required for establishing exponential inequality when analyzing large deviations. Zhao and Li (2012) assumed boundness of all components in g⁽^j⁾(Z; 0), which implies A.4.

Theorem 5

Under assumptions A.3–A.4, there exists a positive constant C₄ depending only on K₃, K₄ and γ₃ appeared in assumption A.4 such that, for any $τ \in (0, \frac{1}{2} - κ)$ and $γ_{n} = c_{3}^{2} n^{2 τ}$ ,

ℙ {M_{*} \subset {\hat{M}}_{γ_{n}}} \leq {\begin{cases} 1 - s exp {- C_{4} n^{(1 - 2 κ) \land ((1 - 2 κ - 2 τ) γ_{3} / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) < 1; \\ 1 - s exp {- C_{4} n^{((1 - κ) / (1 + δ)) \land ((1 - 2 κ - 2 τ) γ_{3} / 2)}}, & i f (1 - 2 κ) (1 + 2 δ) \geq 1; \end{cases}

where $δ = max {\frac{2}{γ_{3}} - 1, 0}$ .

This theorem is a natural extension of Theorem 2 in the broad framework for models specified by general estimating equations. In special cases, we have considered for linear models and generalized linear models, g⁽^j⁾(Z; 0) = X_j Y(j = 1, …, p), and γ₃ in assumption A.4 is equal to $\frac{γ_{1} γ_{2}}{γ_{1} + γ_{2}}$ where γ₁ and γ₂ are specified in A.2.

Let

u_{j} = E {g^{(j)} (Z; 0)} for each j = 1, \dots, p .

We now consider the size of Inline graphic in the setting

max_{j \notin M_{*}} {‖ u_{j} ‖}_{\infty} = o (n^{- κ}) .

This specification also reduces to those considered in special cases of linear models and generalized linear models. The counterpart of Theorem 4 for establishing the selection consistency is given as follows.

Theorem 6

Under assumptions A.3 and A.4, if max_j _∉ ||u_j||_∞ = O(n⁻^η) where η > κ and min_j _∉ λ_min( Inline graphic {g⁽^j⁾(Z; 0)g⁽^j⁾(Z; 0)^T}) ≥ c₄ for some c₄ > 0, where λ_min(A) means the smallest eigenvalue of A, then there exists a positive constant C₅ depending only on K₃, K₄ and γ₃ appeared in assumption A.4 and c₄ such that, for any $τ \in (\frac{1}{2} - η, \frac{1}{2} - κ)$ and, γ_n=c₃,n²^τ

ℙ {∣ {\hat{M}}_{γ_{n}} ∣ > s} \leq {\begin{cases} p exp (- C_{5} n^{2 τ}) + p exp (- C_{5} n^{γ_{3} / 6}), & i f γ_{3} < 2 and η > \frac{1}{4}; \\ p exp (- C_{5} n^{γ_{3} η}) + p exp (- C_{5} n^{γ_{3} / 6}), & i f γ_{3} < 2 and η \leq \frac{1}{4}; \\ p exp (- C_{5} n^{γ_{3} η}) + p exp (- C_{5} n^{γ_{3} / 6}), & i f γ_{3} \geq 2 and η \leq \frac{1}{γ_{3} + 2}; \\ p exp (- C_{5} n^{γ_{3} / (γ_{3} + 2)}) + p exp (- C_{5} n^{2 τ}), & i f γ_{3} \geq 4 and η > \frac{1}{γ_{3} + 2}; \\ p exp (- C_{5} n^{γ_{3} / 6}) + p exp (- C_{5} n^{2 τ}), & i f 2 \leq γ_{3} < 4 and η > \frac{1}{γ_{3} + 2} . \end{cases}

Combining the Theorems 5 and 6, we can see that the screening procedure using the marginal empirical likelihood ratio is valid in a broad framework for identifying the set of the effective features.

4.2. Iterative screening procedure

As we can see from the main results, the proposed marginal empirical likelihood screening procedure works ideally for the case with explanatory variables that are independent of each other. To deal with challenging situations with correlated explanatory variables, we propose to use the following iterative sure independence screening procedure.

Step 1: Rank explanatory variables according to ℓ_j(0) by (2.5) and select top ranked explanatory variables with largest values of ℓ_j(0)’s until some desirable number of features are included. Denote the set of select explanatory variables by .
Step 1′: Apply penalized empirical likelihood [Leng and Tang (2012), Tang and Leng (2010)] to explanatory variables in and denote the final model by .
Step 2: Let ⊂{1, …, p} be the selected model at the kth step. At the kth iteration, for each j ∉ , denote by
${EL}_{{j} \cup {\hat{A}}_{k}} (μ) = sup {\prod_{i = 1}^{n} w_{i} : w_{i} \geq 0, \sum_{i = 1}^{n} w_{i} = 1, \sum_{i = 1}^{n} w_{i} X_{i, {j} \cup {\hat{A}}_{k}} Y_{i} = μ}$

the empirical likelihood for the combined covariates, and denote by
${\tilde{EL}}_{j} (μ) = sup_{μ_{j} = μ} {{EL}_{{j} \cup {\hat{A}}_{k}} (μ)}$

the profile empirical likelihood evaluated at μ. Rank explanatory variable j in ${\hat{A}}_{k}^{c}$ according to ${\tilde{EL}}_{j} (0)$ and select the top ranked until some desirable number of features are included. Denote the selected set by .
Step 2′: Apply penalized empirical likelihood to explanatory variables in ∪ and denote the final model by .
Step 3: Repeat steps 2 and 2′ when either = or the size of reaches a pre-specified number.

The above iterative screening procedure incorporates the profile empirical likelihood. The rationale behind it is to capture the joint impact that may be invisible using the marginal screening procedure if correlations exist among those covariates. Our iterative screening procedure shares some similar features of the analogous ones in Fan and Lv (2008) and Fan and Song (2010). However, on the other hand, the iterative procedure using the profile empirical likelihood ratio shares the feature of the marginal empirical likelihood approach by incorporating the level of uncertainties. In addition, we note that the above iterative procedure is generally applicable in a broad framework.

5. Numerical examples

In this section, we use five simulation examples and a real data example to demonstrate the performance of the proposed empirical likelihood-based screening procedure (denoted by EL-SIS) and corresponding iterative procedure (denoted by EL-ISIS). Depending on the example setting, we compare it with the screening methods proposed in Fan and Lv (2008) (denoted by LS-SIS and LS-ISIS) and Fan and Song (2010) (denoted by GLM-SIS and GLM-ISIS) for linear regression models and generalized linear models, respectively. Whenever appropriate, we compare to the robust rank correlation based screening (RRC-SIS and RRC-ISIS) studied by Li et al. (2012). For all simulation examples, we begin with p = 1000 explanatory variables and screen to a much smaller number d of explanatory variables. The respective SCAD penalized variable selection is further applied to these selected explanatory variables to get the corresponding final model. Results over 200 repetitions are reported. For each case, we report the number of repetitions that each important explanatory variable is selected in the final model and also the average number of unimportant explanatory variables being selected.

Example 1

This example has a very standard setting with three important explanatory variables and is taken from Fan and Lv (2008). Covariates are generated as X_j ~ N(0, 1) and cov(X_j, X_j_′) = 1 if j = j′ and 0.3 otherwise. The response is generated as Y = 5X₁ + 5X₂ + 5X₃ + ε with error being independent of the explanatory variables. We consider three different error distribution N(0, 1), N(0, 2²), and t₄ for ε. Random samples of size n = 100 are used and we set d = ⌊n/(2 log n)⌋ = 10, where ⌊a⌋ denotes the largest integer that is less than or equal to a. Results over 200 repetitions are reported in Table 1, where we report the number of repetitions that each of the important explanatory variables X₁, X₂ and X₃ is selected. For unimportant explanatory variables, Table 1 reports their average number of repetitions for each being selected. It shows that the proposed empirical likelihood-based screening methods perform very competitively when compared to the least squares-based screening or the robust rank correlation-based screening.

Table 1.

Simulation result for Example 1

ε	Method	X₁	X₂	X₃	Unimportant explanatory variables
N(0, 1)	LS-SIS	199	199	200	1.406219
	RRC-SIS	199	199	199	1.407222
	EL-SIS	194	183	185	1.442327
	LS-ISIS	200	200	200	0.965898
	RRC-ISIS	200	200	200	0.800401
	EL-ISIS	200	200	200	0.659980
N(0, 2²)	LS-SIS	199	199	200	1.406219
	RRC-SIS	198	198	199	1.409228
	EL-SIS	192	182	183	1.447342
	LS-ISIS	200	200	200	1.404213
	RRC-ISIS	200	200	200	1.403210
	EL-ISIS	200	200	200	0.980943
t₄	LS-SIS	199	199	200	1.406219
	RRC-SIS	198	199	199	1.408225
	EL-SIS	193	186	187	1.438315
	LS-ISIS	200	200	200	1.383149
	RRC-ISIS	200	200	200	1.362086
	EL-ISIS	200	199	200	0.635908

Open in a new tab

Example 2

The second example is also from Fan and Lv (2008) and has a hidden important explanatory variable, which is important but marginally uncorrelated with the response. This example is to illustrate that the proposed iterative empirical likelihood-based screening works effectively in such challenging cases. Covariates are generated as X_j ~ N(0, 1) and cov(X_j, X_j_′) = 1 if j = j′ and 0.3 otherwise except $cov (X_{4}, X_{j}) = \sqrt{0.3}$ for j ≠ 4. The response is generated as $Y = 5 X_{1} + 5 X_{2} + 5 X_{3} - 15 \sqrt{0.3} X_{4} + ε$ with ε being independent of explanatory variables. We consider three different error distribution N(0, 1), N(0, 2²), and t₄. Results over 200 repetitions with n = 100 and d = ⌊n/(2 log n)⌋ = 10 are reported in Table 2. It shows that the empirical likelihood-based screening is challenged by the hidden important explanatory variable X₄ but the corresponding iterative screening can easily pick it up. Overall the performance of the empirical likelihood-based screening methods is very similar to that of the least squares-based screening methods and is better than the robust rank correlation-based screening. Note that iterative version of the robust rank correlation-based screening is residual-based. This explains the improvement of the robust rank correlation-based screening.

Table 2.

Simulation result for Example 2 with a hidden important explanatory variable X₄

ε	Method	X₁	X₂	X₃	X₄ (hidden)	Unimportant explanatory variables
N(0, 1)	LS-SIS	198	197	195	0	1.415663
	RRC-SIS	196	197	194	0	1.418675
	EL-SIS	198	198	197	0	1.412651
	LS-ISIS	200	200	199	196	1.125502
	RRC-ISIS	200	199	200	111	1.157631
	EL-ISIS	199	199	200	193	0.853414
N(0, 2²)	LS-SIS	198	197	194	0	1.416667
	RRC-SIS	196	196	194	0	1.419679
	EL-SIS	198	196	194	0	1.417671
	LS-ISIS	199	200	199	196	1.210843
	RRC-ISIS	199	199	200	96	1.311245
	EL-ISIS	200	200	198	188	0.912651
t₄	LS-SIS	197	197	197	0	1.414659
	RRC-SIS	195	198	196	0	1.416667
	EL-SIS	197	198	196	0	1.414659
	LS-ISIS	199	200	200	196	1.209839
	RRC-ISIS	200	200	199	100	1.305221
	EL-ISIS	200	198	200	185	0.824297

Open in a new tab

Example 3

The performances of the empirical likelihood-based screening and the least squares-based screening methods are very similar in the previous two examples. It is known that the empirical likelihood approach requires a less restrictive distributional assumption. We next use a heteroscedastic example to show the advantage of the empirical likelihood-based screening. Explanatory variables are generated as X_j ~ N(0, 1) with cov(X_j, X_j_′) = 0 for j ≠ j′. The response is generated as $Y = c (X_{1} - X_{2} + X_{3}) + ε / (X_{1}^{2} + X_{2}^{2} + X_{3}^{2})$ with independent ε ~ N(0, 1) and c > 0 controls the signal level. Results over 200 repetitions with n = 70 and d = ⌊n/(2 log n)⌋ = 8 are reported in Table 3 for three different values of c. It shows that the performance of the least squares-based screening is severely affected by the heteroscedasticity especially when the signal level is low. On the other hand, the proposed empirical likelihood-based screening works much better and similarly as the robust rank correlation-based screening.

Table 3.

Simulation result for Example 3

c	Method	X₁	X₂	X₃	Unimportant explanatory variables
1	LS-SIS	149	147	156	1.151454
	RRC-SIS	191	185	190	1.037111
	EL-SIS	190	184	191	1.038114
1.5	LS-SIS	173	171	174	1.085256
	RRC-SIS	194	191	193	1.025075
	EL-SIS	196	192	194	1.021063
2	LS-SIS	182	182	180	1.059178
	RRC-SIS	194	194	195	1.020060
	EL-SIS	199	195	194	1.015045

Open in a new tab

Example 4

We now consider an example with the extended scope. In this example, we generate data from the longitudinal data example as in Section 4.2 with m = 4 means 4 repeated measurements generated. In particular, the following model is generated:

Y_{i l} = X_{i l}^{T} β + ε_{i l} (i = 1, \dots, n; l = 1, \dots, m) .

Here X_il is generated from multivariate normal N(0, Σ) with Σ = (σ_jk)_j,k_{=1, …,} _p and σ_jk = 0.5^|^j⁻^k^|. The error vector ε_i = (ε_i₁, …, ε_im)^T is generated from multivariate normal distribution with unit variance. The correlation structure of ε is specified as AR(1) with parameter 0.8; see Diggle et al. (2002) for reference for the correlation structure. The first five components of the true β is set to be c · (2.0, −2.0, 0, 0, 2.0)^T where c is used to control the signal strength, and all other components of β are zero. We use two sets of basis matrices in (4.2). We take M₁ = I as the identity matrix. The second basis matrix M₂ is a matrix with two main off-diagonals being 1 and 0 elsewhere corresponding to the AR(1) working correlation [Qu, Lindsay and Li (2000)]. We then apply the marginal empirical likelihood procedure as in Section 4 using the marginal estimating function of (4.2). Here we note that the marginal estimating function is 4-dimensional. By ignoring the correlation structure of the longitudinal data, the least squares-based screening and robust rank correlation-based screening procedures can be applied. Results over 200 repetitions with n = 60 and d = 15 are reported in Table 4. From Table 4, we clearly see that the marginal empirical likelihood approach works much better than the alternative ones, especially when signal is relatively weak. The improvement can be seen as the results of incorporating additional data structural information. Hence, we demonstrate an advantage of the marginal empirical likelihood approach of being adaptive and flexible.

Table 4.

Simulation result for the longitudinal data estimation function example with c controlling the signal strength

c	Method	X₁	X₂	X₅	Unimportant explanatory variables
1	LS-SIS	90	73	153	2.692076
	GEE-SIS	111	111	168	2.617854
	RRC-SIS	84	66	136	2.722166
	EL-SIS	135	128	191	2.553661
1.5	LS-SIS	153	153	195	2.506520
	GEE-SIS	165	160	196	2.486459
	RRC-SIS	142	136	193	2.536610
	EL-SIS	176	187	199	2.445336
2	LS-SIS	183	183	200	2.441324
	GEE-SIS	183	184	200	2.440321
	RRC-SIS	179	176	200	2.452357
	EL-SIS	192	196	200	2.419258
2.5	LS-SIS	195	195	200	2.417252
	GEE-SIS	196	195	200	2.416249
	RRC-SIS	192	190	200	2.425276
	EL-SIS	198	197	200	2.412237
3	LS-SIS	199	198	200	2.410231
	GEE-SIS	198	197	200	2.412237
	RRC-SIS	199	198	200	2.410231
	EL-SIS	200	198	200	2.409228

Open in a new tab

In the review process, one referee pointed out that our comparison to the LS-SIS is not fair as it is based on the ordinary least squares. It is more reasonable to compare to a weighted least squares-based screening by adjusting to correlation among longitudinal observations. To address this issue, we implement this weighted least squares-based screening by using the R package “geepack,” which can estimate both the correlation structure and regression parameter once a parametric form of the correlation structure is specified. Table 4 is updated accordingly with GEE-SIS denoting this weighted least squares-based screening method. It shows that our newly proposed EL-SIS still does better than the GEE-SIS even though a correct parametric correlation structure, AR(1), is specified.

Example 5

This is an extension of Example 2 to the case with a binary response using logistic regression. Covariates are generated as X_j ~ N(0, 1) and cov(X_j, X_j_′) = 1 if j = j′ and 0.3 otherwise except $cov (X_{4}, X_{j}) = \sqrt{0.3}$ for j ≠ 4. The binary response is generated from Bernoulli distribution with success probability given by ${1 + exp (- 4 X_{1} - 4 X_{2} - 4 X_{3} + 12 \sqrt{0.3} X_{4})}^{- 1}$ . Results over 200 repetitions with n = 400 and d = 10 are reported in Table 5. A similar performance pattern is observed. For this example, the result for the iterative version of the robust rank correlation-based screening is not presented since it is not clear how to define a residual-based iterative procedure.

Table 5.

Simulation result for Example 5

Method	X₁	X₂	X₃	X₄ (hidden)	Unimportant explanatory variables
GLM-SIS	200	200	200	0	1.405623
RRC-SIS	200	200	200	0	1.405623
EL-SIS	200	200	200	0	1.405623
GLM-ISIS	200	200	200	200	0.324297
EL-ISIS	199	200	200	199	0.764056

Open in a new tab

A real data example

Glioblastoma is the most common primary malignant brain tumor of adults and one of the most lethal of all cancers [Horvath et al. (2006)]. The median survival of glioblastoma patients is 15 months from the time of diagnosis. We next apply our proposed methods to a microarray gene expression dataset of glioblastoma patients reported in Horvath et al. (2006). The dataset has been analyzed by Pan, Xie and Shen (2010) and Li and Li (2008) among many others. Drawn from two different studies, the data consist of two independent sets. We use the set with 50 samples. We use the log survival time, measured in years, as the response. The second sample with a outlier response is excluded and the other 49 samples are used in our analysis. Explanatory variables are gene expression profiles of 1523 genes measured on Affymetrix HG-U133A arrays.

We apply the least squares-based and empirical likelihood-based screening methods with d = 6. LS-SIS selects “GSN”, “FOS”, “COL11A1”, “AVPR1A”, “SELE”, and “TBL1X” as important gene explanatory variables while EL-SIS selects “GSN”, “JAK2”, “COL11A1”, “CDK6”, “ADCYAP1R1”, and “TBL1X”. Note that they select some common genes (“GSN” and “COL11A1”) and some different genes. LS-ISIS selects “GSN”, “COL11A1”, “THBS1”, “SELE”, “TBL1X”, and “GCGR”. EL-ISIS selects “DUSP7”, “COL11A1”, “BST1”, “ADCYAP1R1”, “TBL1X”, and “GCGR”. Similarly two genes (“TBL1X” and “GCGR”) are recruited by the iterative screening methods based on both the least squares and empirical likelihood. The robust rank correlation-based screening performs similarly with 2–3 overlapping genes.

6. Discussion

Screening based on marginal model fitting has enjoyed great popularity in the recent literature. However, most, if not all, of the marginal screening methods studied thus far are based on some restrictive distributional assumptions. Yet these assumptions may not be realistic in applications. Thus motivated we propose a new screening method based on marginal empirical likelihood, which is known to be less restrictive. It has been demonstrated to be effective using both theoretical sure screening property and numerical evidences. Further extensions using empirical likelihood are being investigated.

Supplementary Material

supplementary file

NIHMS535968-supplement-supplementary_file.pdf^{(181.7KB, pdf)}

Acknowledgments

We thank the Editor, the Associate Editor and two referees for very constructive comments and suggestions which have improved the presentation of the paper. We are very grateful to Drs. Gaorong Li, Yi Li, Heng Peng and Sihai Dave Zhao for sharing with us programs for implementing their methods and Drs. Hongzhe Li and Wei Pan for sharing the real data.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Marginal empirical likelihood and sure independence feature screening” (DOI: 10.1214/13-AOS1139SUPP; .pdf). This supplement contains all technical proofs.

References

Bühlmann P, van de Geer S. Theory and Applications. Springer; Heidelberg: 2011. Statistics for High-dimensional Data: Methods. [Google Scholar]
Chang J, Chen SX, Chen X. High dimensional generalized empirical likelihood for moment restrictions with dependent data. 2013 Available at arXiv:1308.5732. [Google Scholar]
Chang J, Tang CY, Wu Y. Supplement to “Marginal empirical likelihood and sure independence feature screening”. 2013 doi: 10.1214/13-AOS1139SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen SX, Cui H. An extended empirical likelihood for generalized linear models. Statist Sinica. 2003;13:69–81. [Google Scholar]
Chen SX, Gao J, Tang CY. A test for model specification of diffusion processes. Ann Statist. 2008;36:167–198. [Google Scholar]
Chen SX, Peng L, Qin Y-L. Effects of data dimension on empirical likelihood. Biometrika. 2009;96:711–722. [Google Scholar]
Chen SX, Van Keilegom I. A review on empirical likelihood methods for regression. TEST. 2009;18:415–447. doi: 10.1007/s11749-009-0165-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL. Oxford Statistical Science Series. 2. Oxford Univ. Press; Oxford: 2002. Analysis of Longitudinal Data; p. 25. [Google Scholar]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statist Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38:3567–3604. [Google Scholar]
Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; New York: 2009. [Google Scholar]
Hjort NL, McKeague IW, Van Keilegom I. Extending the scope of empirical likelihood. Ann Statist. 2009;37:1079–1111. [Google Scholar]
Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS. Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci USA. 2006;103:17402–17407. doi: 10.1073/pnas.0608396103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]
Kolaczyk ED. Empirical likelihood for generalized linear models. Statist Sinica. 1994;4:199–218. [Google Scholar]
Leng C, Tang CY. Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika. 2012;99:703–716. [Google Scholar]
Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Amer Statist Assoc. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li G, Peng H, Zhang J, Zhu L. Robust rank correlation based screening. Ann Statist. 2012;40:1846–1877. [Google Scholar]
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. Chapman & Hall/CRC; New York: 1989. [Google Scholar]
Newey WK, Smith RJ. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica. 2004;72:219–255. [Google Scholar]
Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]
Owen AB. Empirical Likelihood. Chapman & Hall/CRC; New York: 2001. [Google Scholar]
Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010;66:474–484. doi: 10.1111/j.1541-0420.2009.01296.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Petrov VV. Oxford Studies in Probability. Vol. 4. Oxford Univ. Press; New York: 1995. Limit Theorems of Probability Theory: Sequences of Independent Random Variables. [Google Scholar]
Qin J, Lawless J. Empirical likelihood and general estimating equations. Ann Statist. 1994;22:300–325. [Google Scholar]
Qu A, Lindsay BG, Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87:823–836. [Google Scholar]
Saulis L, Statulevičius VA. Limit Theorems for Large Deviations Mathematics and Its Applications (Soviet Series) Kluwer Academic; Dordrecht: 1991. p. 73. Translated and revised from the 1989 Russian original. [Google Scholar]
Tang CY, Leng C. Penalized high-dimensional empirical likelihood. Biometrika. 2010;97:905–919. [Google Scholar]
Wang H. Factor profiled sure independence screening. Biometrika. 2012;99:15–28. [Google Scholar]
Xue L, Zou H. Sure independence screening and compressed random sensing. Biometrika. 2011;98:371–380. [Google Scholar]
Zhao SD, Li Y. Sure screening for estimating equations in ultra-high dimensions. 2012 Unpublished manuscript. [Google Scholar]
Zhu L-P, Li L, Li R, Zhu L-X. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary file

NIHMS535968-supplement-supplementary_file.pdf^{(181.7KB, pdf)}

[R1] Bühlmann P, van de Geer S. Theory and Applications. Springer; Heidelberg: 2011. Statistics for High-dimensional Data: Methods. [Google Scholar]

[R2] Chang J, Chen SX, Chen X. High dimensional generalized empirical likelihood for moment restrictions with dependent data. 2013 Available at arXiv:1308.5732. [Google Scholar]

[R3] Chang J, Tang CY, Wu Y. Supplement to “Marginal empirical likelihood and sure independence feature screening”. 2013 doi: 10.1214/13-AOS1139SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen SX, Cui H. An extended empirical likelihood for generalized linear models. Statist Sinica. 2003;13:69–81. [Google Scholar]

[R5] Chen SX, Gao J, Tang CY. A test for model specification of diffusion processes. Ann Statist. 2008;36:167–198. [Google Scholar]

[R6] Chen SX, Peng L, Qin Y-L. Effects of data dimension on empirical likelihood. Biometrika. 2009;96:711–722. [Google Scholar]

[R7] Chen SX, Van Keilegom I. A review on empirical likelihood methods for regression. TEST. 2009;18:415–447. doi: 10.1007/s11749-009-0165-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL. Oxford Statistical Science Series. 2. Oxford Univ. Press; Oxford: 2002. Analysis of Longitudinal Data; p. 25. [Google Scholar]

[R9] Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statist Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38:3567–3604. [Google Scholar]

[R13] Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50:1029–1054. [Google Scholar]

[R14] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; New York: 2009. [Google Scholar]

[R15] Hjort NL, McKeague IW, Van Keilegom I. Extending the scope of empirical likelihood. Ann Statist. 2009;37:1079–1111. [Google Scholar]

[R16] Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS. Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci USA. 2006;103:17402–17407. doi: 10.1073/pnas.0608396103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist. 2008;36:587–613. [Google Scholar]

[R18] Kolaczyk ED. Empirical likelihood for generalized linear models. Statist Sinica. 1994;4:199–218. [Google Scholar]

[R19] Leng C, Tang CY. Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika. 2012;99:703–716. [Google Scholar]

[R20] Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]

[R21] Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Amer Statist Assoc. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Li G, Peng H, Zhang J, Zhu L. Robust rank correlation based screening. Ann Statist. 2012;40:1846–1877. [Google Scholar]

[R23] Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]

[R24] McCullagh P, Nelder JA. Generalized Linear Models. Chapman & Hall/CRC; New York: 1989. [Google Scholar]

[R25] Newey WK, Smith RJ. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica. 2004;72:219–255. [Google Scholar]

[R26] Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75:237–249. [Google Scholar]

[R27] Owen AB. Empirical Likelihood. Chapman & Hall/CRC; New York: 2001. [Google Scholar]

[R28] Pan W, Xie B, Shen X. Incorporating predictor network in penalized regression with application to microarray data. Biometrics. 2010;66:474–484. doi: 10.1111/j.1541-0420.2009.01296.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Petrov VV. Oxford Studies in Probability. Vol. 4. Oxford Univ. Press; New York: 1995. Limit Theorems of Probability Theory: Sequences of Independent Random Variables. [Google Scholar]

[R30] Qin J, Lawless J. Empirical likelihood and general estimating equations. Ann Statist. 1994;22:300–325. [Google Scholar]

[R31] Qu A, Lindsay BG, Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87:823–836. [Google Scholar]

[R32] Saulis L, Statulevičius VA. Limit Theorems for Large Deviations Mathematics and Its Applications (Soviet Series) Kluwer Academic; Dordrecht: 1991. p. 73. Translated and revised from the 1989 Russian original. [Google Scholar]

[R33] Tang CY, Leng C. Penalized high-dimensional empirical likelihood. Biometrika. 2010;97:905–919. [Google Scholar]

[R34] Wang H. Factor profiled sure independence screening. Biometrika. 2012;99:15–28. [Google Scholar]

[R35] Xue L, Zou H. Sure independence screening and compressed random sensing. Biometrika. 2011;98:371–380. [Google Scholar]

[R36] Zhao SD, Li Y. Sure screening for estimating equations in ultra-high dimensions. 2012 Unpublished manuscript. [Google Scholar]

[R37] Zhu L-P, Li L, Li R, Zhu L-X. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MARGINAL EMPIRICAL LIKELIHOOD AND SURE INDEPENDENCE FEATURE SCREENING

Jinyuan Chang

Cheng Yong Tang

Yichao Wu

Abstract

1. Introduction

2. Methodology

2.1. Marginal empirical likelihood for linear models

2.2. Extended coverage to generalized linear models

3. Main results

Proposition 1

Theorem 1

Proposition 2

Corollary 1

Theorem 2

Proposition 3

Theorem 3

Proposition 4

Theorem 4

4. Extensions

4.1. A broad framework

Example (Quadratic inference function (QIF) approach [Qu, Lindsay and Li (2000)])

Theorem 5

Theorem 6

4.2. Iterative screening procedure

5. Numerical examples

Example 1

Table 1.

Example 2

Table 2.

Example 3

Table 3.

Example 4

Table 4.

Example 5

Table 5.

A real data example

6. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases