Feature Screening for Ultrahigh Dimensional Categorical Data with Applications

Danyang Huang; Runze Li; Hansheng Wang

doi:10.1080/07350015.2013.863158

. Author manuscript; available in PMC: 2015 May 16.

Published in final edited form as: J Bus Econ Stat. 2014 May 16;32(2):237–244. doi: 10.1080/07350015.2013.863158

Feature Screening for Ultrahigh Dimensional Categorical Data with Applications

Danyang Huang ¹, Runze Li ¹, Hansheng Wang ¹

PMCID: PMC4197855 NIHMSID: NIHMS590985 PMID: 25328278

Abstract

Ultrahigh dimensional data with both categorical responses and categorical covariates are frequently encountered in the analysis of big data, for which feature screening has become an indispensable statistical tool. We propose a Pearson chi-square based feature screening procedure for categorical response with ultrahigh dimensional categorical covariates. The proposed procedure can be directly applied for detection of important interaction effects. We further show that the proposed procedure possesses screening consistency property in the terminology of Fan and Lv (2008). We investigate the finite sample performance of the proposed procedure by Monte Carlo simulation studies, and illustrate the proposed method by two empirical datasets.

Keywords: Feature Screening, Pearson’s Chi-Square Test, Screening Consistency, Search Engine Marketing, Text Classification, Ultrahigh Dimensional Data

1. INTRODUCTION

Since the seminal work of Fan and Lv (2008), feature screening for ultrahigh dimensional data has received considerable attention in the recent literature. Wang (2009) proposed forward regression method for feature screening in ultrahigh dimensional linear models. Fan, Samworth and Wu (2009) and Fan and Song (2010) developed sure independence screening (SIS) procedures for generalized linear models and robust linear models. Fan, Feng, and Song (2001) developed nonparametric SIS procedure for additive models. Li, Peng, Zhang and Zhu (2012) developed rank correlation based SIS procedure for linear models. Liu, Li and Wu (2013) developed a SIS procedure for varying coefficient model based on conditional Pearson’s correlation. Procedures aforementioned are all model-based methods. In the analysis of ultrahigh dimensional data, it would be very challenging in specifying a correct model in the initial stage. Thus, Zhu et al. (2011) advocated model-free procedures and proposed a sure independence and ranking screening procedure based on multi-index models. Li, Zhong and Zhu (2012) proposed a model-free SIS procedure based on distance correlation (Szekely, Rizzo and Bakirov, 2007). He, Wang and Hong (2013) proposed a quantile-adaptive model-free SIS for ultrahigh dimensional heterogeneous data. Mai and Zou (2013) proposed a SIS procedure for binary classification with ultrahigh dimensional covariates based on Kolmogorov’s statistic. The aforementioned methods implicitly assume that predictor variables are continuous. Ultrahigh dimensional data with categorical predictors and categorical responses are frequently encountered in practice. This work aims to develop a new SIS-type procedure for this particular situation.

This work was partially motivated by an empirical analysis of data related to search engine marketing (SEM), which is also referred to as paid search advertising (PSA). It has been standard practice to make textual advertisements on search engines such as Google in USA and Baidu in China. Keyword management plays a critical role in textual advertisements, and therefore is of particular importance in SEM practice. Specifically, in order to maximize the amount of potential customers, the SEM practitioner typically maintains a large number of relevant keywords. Depending on the business scale, the total number of keywords ranges from thousands to millions. Practically managing so many keywords is a challenging task. For an easy management, the keywords need to be classified into fine groups. This is a requirement enforced by all major search engines (e.g., Google and Baidu). Ideally, the keywords belong to the same group should bear similar textual formulation and semantic meaning. This is a nontrivial task demanding tremendous efforts and expertise. The current industry practice largely relies on human forces, which is expensive and inaccurate. This is particular true in China, which has the largest emerging SEM market in the world. Then, how to automatically classify Chinese keywords into pre-specified groups becomes a problem of great importance. Such a problem indeed is how to handle high dimensional categorical feature construction and how to identify important features.

From statistical point of view, we can formulate the problem as follows. We treat each keyword as a sample and index it by i with 1 ≤ i ≤ n. Next, let Y_i ∈ {1, 2, … , K} be the class label. We next convert the textual message contained in each keyword to a high dimensional binary indicator. Specifically, we collect a set of most frequently used Chinese characters and index them by j with 1 ≤ j ≤ p. Define a binary indicator X_ij as X_ij = 1 if the jth Chinese character appears in the ith keyword and X_ij = 0 otherwise. Collect all those binary indicators by a vector $X_{i} = {(X_{i 1}, \dots, X_{ip})}^{⊺} \in R^{p}$ . Because the total number of Chinese characters is huge, the dimension of X_i (i.e., p) is ultrahigh. Subsequently, the original problem about keyword management becomes an ultrahigh dimensional classification problem from X_i to Y_i. Many existing methods, including k-Nearest Neighbors (Hastie et al., 2001, kNN), random forest (Breiman, 2001, RF), and support vector machine (Tong and Koller, 2001; Kim et al., 2005, SVM) can be used for high dimensional binary classification. However these methods become instable if the problem is ultrahigh dimensional. As a result, feature screening becomes indispensable.

This paper aims to develop a feature screening procedure for multiclass classification with ultrahigh dimensional categorical predictors. To this end, we propose using Pearson’s chi-square (PC) test statistic to measure the dependence between categorical response and categorical predictors. We develop a screening procedure based on the Pearson chi-square test statistic. Since the Pearson chi-square test statistic can be directly calculated using most statistical software packages. Thus, the proposed procedure can be easily implemented in practice. We further study the theoretical property of the proposed procedure. We rigorously prove that, with overwhelming probability, the proposed procedure can retain all important features, which implies the sure independence screening (SIS) property in the terminology of Fan and Lv (2008). In fact, under certain conditions, the proposed method can correctly identify the true model consistently. For convenience, the proposed procedure is referred to as PC-SIS, which possesses the following virtues.

The PC-SIS is a model-free screening procedure because the implementation of PC-SIS does not require one to specify a model for the response and predictors. This is an appealing property since it is challenging to specify a model in the initial stage of analyzing ultrahigh dimensional data. The PC-SIS can be directly applied for multi-categorical response and multi-categorical predictors. The PC-SIS has excellent capability in detecting important interaction effects by creating new categorical predictors for interactions between predictors. Furthermore, the PC-SIS is also applicable for multiple response and grouped or multivariate predictors by defining a new univariate categorical variable for the multiple response or the grouped predictors. Lastly, by appropriate categorization, PC-SIS can handle the situation with both categorical and continuous predictors. In summary, the PC-SIS provides a unified approach for feature screening in ultrahigh dimensional categorical data analysis. We conduct Monte Carlo simulation to empirically verify our theoretical findings, and illustrate the proposed methodology by two empirical datasets.

The rest of this article is organized as follows. Section 2 describes the detailed procedure of PC-SIS and establishes its theoretical property. Section 3 presents some numerical studies. Section 4 presents two real world applications. The conclusion remark is given in Section 5. Technical proofs are given in the Appendix.

2. The Pearson Chi-Square Test based Screening Procedure

2.1. Sure Independence Screening

Let Y_i ∈ {1, … , K} be the corresponding class label, and $X_{i} = {(X_{i 1}, \dots, X_{ip})}^{⊺} \in R^{p}$ be the associated categorical predictor. Since the predictors involved in our intended SEM application are binary, we assume thereafter that X_ij is binary. This allows us to slightly simplify our notation and technical proofs. However, the developed method and theory can be readily applied to general categorical predictors. Define a generic notation $S = {j_{1}, \dots, j_{d}}$ to be a model with X_ij1, … , X_{ij_d} included as relevant features. Let $∣ S ∣ = d$ be the model size. Let $X_{i (S)} = (X_{ij} : j \in S) \in R^{∣ S ∣}$ be the subvector of X_i according to $S$ . Define $D (Y_{i} ∣ X_{i (S)})$ to be the conditional distribution of Y_i given $X_{i (S)}$ . Then a candidate model $S$ is called sufficient, if

D (Y_{i} ∣ X_{i}) = D (Y_{i} ∣ X_{i (S)}) .

(2.1)

Obviously, the full model $S_{F} = {1, 2, \dots, p}$ is sufficient. Thus, we are only interested in the smallest sufficient model. Theoretically, we can consider the intersection of all sufficient models. If the intersection is still sufficient, it must be the smallest. We call it the true model and denote it by $S_{T}$ . Throughout the rest of this article, we assume $S_{T}$ exists with $∣ S_{T} ∣ = d_{0}$ .

The objective of feature screening is to find a model estimate $\hat{S}$ such that: (1) $\hat{S} \supset S_{T}$ ; and (2) the size of $∣ \hat{S} ∣$ is as small as possible. To this end, we follow b the marginal screening idea of Fan and Lv (2008) and propose the Pearson chi-square type statistic as follows. Define P(Y_i = k) = π_yk, P(X_ij = k) = π_jk, and P(Y_i = k₁, X_ij = k₂) = π_yj,k₁k₂. Those quantities can be estimated by ${\hat{π}}_{yk} = n^{- 1} \sum I (Y_{i} = k), {\hat{π}}_{jk} = n^{- 1} \sum I (X_{ij} = k)$ , and ${\hat{π}}_{yj, k_{1} k_{2}} = n^{- 1} \sum I (Y_{i} = k_{1}) I (X_{ij} = k_{2})$ . Subsequently, a chi-square type statistic can be defined as

{\hat{Δ}}_{j} = \sum_{k_{1} = 1}^{K} \sum_{k_{2} = 1}^{2} \frac{{({\hat{π}}_{{yk}_{1}} {\hat{π}}_{{jk}_{2}} - {\hat{π}}_{yj, k_{1} k_{2}})}^{2}}{{\hat{π}}_{{yk}_{1}} {\hat{π}}_{{jk}_{2}}},

(2.2)

which is a natural estimator of

Δ_{j} = \sum_{k_{1} = 1}^{K} \sum_{k_{2} = 1}^{2} \frac{{(π_{{yk}_{1}} π_{{jk}_{2}} - π_{yj, k_{1} k_{2}})}^{2}}{π_{{yk}_{1}} π_{{jk}_{2}}} .

(2.3)

Obviously, those predictors with larger ${\hat{Δ}}_{j}$ values are more likely to be relevant. As a result, we can estimate the true model by $\hat{S} = {j : {\hat{Δ}}_{j} > c}$ , where c > 0 is some pre-specified constant. For convenience, we refer to $\hat{S}$ as a PC-SIS estimator.

Remark 1

As one can see, $\hat{S}$ can be equivalently defined in terms of p-value. Specifically, define ${\hat{P}}_{j} = P (χ_{K}^{2} > n {\hat{Δ}}_{j})$ , where $χ_{K}^{2}$ stands for a chi-squared distribution with K degrees of freedom. Because ${\hat{P}}_{j}$ is monotonically decreasing function in ${\hat{Δ}}_{j}, \hat{S}$ can be equivalently expressed as $\hat{S} = {j : {\hat{P}}_{j} < p_{c}}$ for some constant 0 < p_c < 1. In the situation, where the number of categories involved by each predictor is different, the predictor involved more categories is likely to be associated with larger Δ_j values, regardless of whether the predictor is important or not. In that case, directly using Δ_j for variable screening is less accurate. Instead, using p-value ${\hat{P}}_{j}$ is more appropriate.

2.2. Theoretical Properties

We next investigate the theoretical properties of $\hat{S}$ . Define $ω_{j}^{k_{1} k_{2}} = cov {I (Y_{i} = k_{1}), I (X_{ij} = k_{2})}$ . We then assume the following conditions.

(C1) (Response Probability) Assume that there exist two positive constants 0 < π_min < π_max < 1 such that π_min < π_yk < π_max for every 1 ≤ k ≤ K and π_min < π_jk < π_max for every 1 ≤ j ≤ p and 1 ≤ k ≤ K.

(C2) (Marginal Covariance) Assume $Δ_{j} = 0$ for any $j \notin S_{T}$ . We further assume that there exists positive constant ω_min, such that $\min_{j \in S_{T}} \max_{k_{1} k_{2}} {(ω_{j}^{k_{1} k_{2}})}^{2} > ω_{\min}$ .

(C3) (Divergence Rate) Assume log p ≤ νn^ξ for some constants ν > 0 and 0 < ξ < 1.

Condition (C1) excludes those features with one particular category’s response probability extremely small (i.e., π_yk ≈ 0) or extremely large (i.e., π_yk ≈ 1). Condition (C2) requires that, for every relevant categorical feature $j \in S_{T}$ , there exists at least one response category (i.e., k₁) and one feature category (i.e., k₂), which are marginally correlated (i.e., $ω_{j}^{k_{1} k_{2}} > ω_{\min}$ ). Under a linear regression setup, similar condition was also used by Fan and Lv (2008) but in terms of the marginal covariance. Condition (C2) also assumes that Δ_j = 0 for every $j \notin S_{T}$ . With the help of this condition, we can rigorously show that $\hat{S}$ is selection consistent for $S_{T}$ , that is $P (\hat{S} = S_{T}) \to 1$ as n → ∞ in Theorem 1. If this condition b is removed, the conclusion becomes screening consistent (Fan and Lv, 2008), that is $P (\hat{S} \supset S_{T}) \to 1$ as n → ∞. Lastly, condition (C3) allows the feature dimension p to diverge at b an exponentially fast speed in terms of the sample size n. Accordingly, the feature dimension could be much larger than sample size n. Then, we have the following theorem.

Theorem 1. (Strong Screening Consistency) Under Conditions (C1)−(C3), there exists a positive constant c such that $P (\hat{S} = S_{T}) \to 1$ .

2.3. Interaction Screening

Interaction detection is important for the intended SEM application. Consider two arbitrary feature X_ij1 and X_ij2. We say they are free of interaction effect if conditioning on Y_i, they are independent with each other. Otherwise, we say they have nontrivial interaction effect. Theoretically, such an interaction effect can be conveniently measured by

Ω_{j_{1} j_{2}} = \sum_{k} \sum_{k_{1} = 1}^{2} \sum_{k_{2} = 1}^{2} \frac{{(π_{k, j_{1}, k_{1}} π_{k, j_{2}, k_{2}} - π_{k, j_{1} j_{2}, k_{1} k_{2}})}^{2}}{π_{k, j_{1}, k_{1}} π_{k, j_{2}, k_{2}}},

where π_{k,j₁j₂,k₁k₂} = P(X_ij1 = k₁, X_ij2 = k₂∣Y_i = k) and π_k,j,k* = P(X_ij = k*∣Y_i = k). They can be estimated, respectively, by ${\hat{π}}_{k, j_{1} j_{2}, k_{1} k_{2}} = {\sum I (Y_{i} = k)}^{- 1} \sum I (X_{{ij}_{1}} = k_{1}, X_{{ij}_{2}} = k_{2}, Y_{i} = k)$ , and ${\hat{π}}_{k, j, k^{*}} = {\sum I (Y_{i} = k)}^{- 1} \sum I (X_{ij} = k^{*}, Y_{i} = k)$ . Subsequently, Ω_j1j2 can be estimated by

{\hat{Ω}}_{j_{1} j_{2}} = \sum_{k} \sum_{k_{1} = 1}^{2} \sum_{k_{2} = 1}^{2} \frac{{({\hat{π}}_{k, j_{1}, k_{1}} {\hat{π}}_{k, j_{2}, k_{2}} - {\hat{π}}_{k, j_{1} j_{2}, k_{1} k_{2}})}^{2}}{{\hat{π}}_{k, j_{1}, k_{1}} {\hat{π}}_{k, j_{2}, k_{2}}} .

Accordingly, those interaction terms with large ${\hat{Ω}}_{j_{1} j_{2}}$ values should be considered as promising ones. As a result, it is natural to select important interaction effects by $\tilde{I} = {(j_{1}, j_{2}) : {\hat{Ω}}_{j_{1} j_{2}} > c}$ for some critical value c > 0. It is remarkable that the critical value c used here is typically different from that of $\hat{S}$ . As one can imagine, searching for important interaction effects over every possible feature pair is computationally expensive. To save computational cost, we suggest to focus on those features in $\hat{S}$ . This leads to the following practical solution

\hat{I} = {(j_{1}, j_{2}) : {\hat{Ω}}_{j_{1} j_{2}} > c and j_{1}, j_{2} \in \hat{S}} .

(2.4)

Under appropriate conditions, we can also show that $I (\hat{I} = I_{T}) \to \infty$ as n → ∞, where $I_{T} = {(j_{1}, j_{2}) : Ω_{j_{1} j_{2}} > 0}$ .

2.4. Tuning Parameter Selection

We first consider tuning parameter selection for $\hat{S}$ . To this end, various non-negative values can be considered for c. This leads to a set of candidate models, which are collected by a solution path $F = {S_{j} : 1 \leq j \leq p}$ , where $S_{j} = {k_{1}, \dots, k_{j}}$ . Here {k₁, … , k_p} is a permutation of {1, … , p} such that ${\hat{Δ}}_{k_{1}} \geq {\hat{Δ}}_{k_{2}} \geq \dots \geq {\hat{Δ}}_{k_{p}}$ . As a result, the original problem about tuning parameter selection for c is converted into a problem about model selection for $F$ . To solve the problem, we propose the following maximum ratio criterion. To illustrate the idea, we temporarily assume that $S_{T} \in F$ . Recall that the true model size is $∣ S_{T} ∣ = d_{0}$ . We then should have ${\hat{Δ}}_{k_{j}} ∕ {\hat{Δ}}_{k_{j + 1}} \to_{p} c_{jj + 1}$ for some positive constant c_jj+1 > 0, as long as j + 1 ≤ d₀. One the other side, if j > d₀, we should have both ${\hat{Δ}}_{j}$ and ${\hat{Δ}}_{j + 1}$ converge in probability towards 0. If their convergence rates are comparable, we should have ${\hat{Δ}}_{k_{j}} ∕ {\hat{Δ}}_{k_{j + 1}} = O_{p} (1)$ . However, if j = d₀, we should have ${\hat{Δ}}_{j} \to_{p} c_{j}$ for some positive constant c_j > 0 but ${\hat{Δ}}_{j + 1} \to_{p} 0$ . This makes the ratio ${\hat{Δ}}_{k_{j}} ∕ {\hat{Δ}}_{k_{j + 1}} \to_{p} \infty$ . This suggests that d₀ can be estimated by

\hat{d} = {argmax}_{0 \leq j \leq p - 1} {\hat{Δ}}_{k_{j}} ∕ {\hat{Δ}}_{k_{j + 1}},

where ${\hat{Δ}}_{0}$ is defined to be ${\hat{Δ}}_{0} = 1$ for the sake of completeness. Accordingly, the final model estimate is given by $\hat{S} = {j_{1}, j_{2}, \dots, j_{\hat{d}}} \in F$ . Similar idea also can be used to estimate the interaction model $\hat{I}$ and get the interaction model size ${\hat{d}}_{I}$ . Our numerical experiments suggest that it works fairly well.

3. SIMULATION STUDIES

3.1. Example 1: a Model without Interaction

We first consider a simple example without any interaction effect. We generate Y_i ∈ {1, 2, … , K} with K = 4 and P(Y_i = k) = 1/K for every 1 ≤ k ≤ K. Define the true model to be $S_{T} = {1, 2, \dots, 10}$ with $∣ S_{T} ∣ = 10$ . Next, conditional on Y_i, we generate relevant features as P(X_ij = 1∣Y_i = k) = θ_kj for every 1 ≤ k ≤ K and $j \in S_{T}$ . Their detailed values are given in Table 1. Then, for any 1 ≤ k ≤ K and $j \notin S_{T}$ , we define θ_kj = 0.5. For a comprehensive evaluation, various feature dimensions (p = 1000, 5000) and sample sizes (n = 200, 500, 1000) are considered.

Table 1.

Probability Specification for Example 1

	j
θ _kj	1	2	3	4	5	6	7	8	9	10
k=1	0.2	0.8	0.7	0.2	0.2	0.9	0.1	0.1	0.7	0.7
k=2	0.9	0.3	0.3	0.7	0.8	0.4	0.7	0.6	0.4	0.1
k=3	0.7	0.2	0.1	0.6	0.7	0.6	0.8	0.9	0.1	0.8
k=4	0.1	0.9	0.6	0.1	0.3	0.1	0.4	0.3	0.6	0.4

Open in a new tab

For each random replication, the proposed maximum ratio method is used to select both $\hat{S}$ and $\hat{I}$ . Subsequently, the number of correctly identified main effects $CME = ∣ \hat{S} ⋂ S_{T} ∣$ and incorrectly identified main effects $IME = ∣ \hat{S} ⋂ S_{T}^{c} ∣$ with $S_{T}^{c} = S_{F} \ S_{T}$ are computed. The interaction effects are similarly summarized. This leads to the number of correctly and incorrectly identified interaction effects, which are denoted by CIE and IIE, respectively. Moreover, the final model size, that is $MS = ∣ \hat{S} ∣ + ∣ \hat{I} ∣$ , is computed. The coverage percentage, defined by $CP = (∣ \hat{S} ⋂ S_{T} ∣ + ∣ \hat{I} ⋂ I_{T} ∣) ∕ (∣ S_{T} ∣ + ∣ I_{T} ∣)$ , is recorded. Lastly, all those summarizing measures are averaged across the 200 simulation iterations and then reported in Table 2. They correspond to the rows with screening method flagged by $\hat{S} + \hat{I}$ . For comparison purpose, the full main effect model $S_{F}$ (i.e., the model with all the main effect without interaction) and also the selected main effect model $\hat{S}$ (i.e., the model with all the main effect in $\hat{S}$ without interaction) are also included.

Table 2.

Example 1 Detailed Simulation Results

			Main Effect		Interaction Effect
p	n	Method	CME	IME	CIE	IIE	MS	CP%
1000	200	$S_{F}$	10.0	990.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	9.8	0.0	0.0	0.0	9.9	98.6
		$\hat{S} + \hat{I}$	9.8	0.0	0.0	1.1	11.0	98.6
	500	$S_{F}$	10.0	990.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	10.0	0.0	0.0	0.0	10.0	100.0
		$\hat{S} + \hat{I}$	10.0	0.0	0.0	0.2	10.2	100.0
	1000	$S_{F}$	10.0	990.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	10.0	0.0	0.0	0.0	10.0	100.0
		$\hat{S} + \hat{I}$	10.0	0.0	0.0	0.0	10.0	100.0

5000	200	$S_{F}$	10.0	4990.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	9.6	0.0	0.0	0.0	9.6	96.6
		$\hat{S} + \hat{I}$	9.6	0.0	0.0	1.1	10.7	96.6
	500	$S_{F}$	10.0	4990.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	10.0	0.0	0.0	0.0	10.0	100.0
		$\hat{S} + \hat{I}$	10.0	0.0	0.0	0.8	10.8	100.0
	1000	$S_{F}$	10.0	4990.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	10.0	0.0	0.0	0.0	10.0	100.0
		$\hat{S} + \hat{I}$	10.0	0.0	0.0	0.0	10.0	100.0

Open in a new tab

The detailed results are given in Table 2. For a given simulation model, a fixed feature dimension p, and a diverging sample size n, we find that the CME increases towards $∣ S_{T} ∣ = 10$ and IME decreases towards 0, and there is no over-fitting effect. This result corroborates the theoretical result of Theorem 1 very well. In the meanwhile, since there is no interaction in this particular model, CIE is 0 and IIE converges towards 0 as n goes to infinity.

3.2. Example 2: a Model with Interaction

We next investigate an example with genuine interaction effects. Specifically, the class label is generated in the same way as the previous example with K = 4. Conditional on Y_i = k, we generate X_ij with j ∈ {1, 3, 5, 7} according to probability P(X_ij = 1∣Y_i = k) = θ_kj, whose detailed values are given in Table 3. Conditional on Y_i and X_i,2m−1, we generate X_i,2m according to

\begin{matrix} P (X_{i, 2 m} = 1 ∣ Y_{i} = k, X_{i, 2 m - 1} = 0) = 0.05 I (θ_{k, 2 m - 1} \geq 0.5) + 0.4 I (θ_{k, 2 m - 1} < 0.5) \\ P (X_{i, 2 m} = 1 ∣ Y_{i} = k, X_{i, 2 m - 1} = 1) = 0.95 I (θ_{k, 2 m - 1} \geq 0.5) + 0.4 I (θ_{k, 2 m - 1} < 0.5), \end{matrix}

for every 1 ≤ k ≤ K and m ∈ {1, 2, 3, 4}. Lastly, we define θ_kj = 0.4 for any 1 ≤ k ≤ K and j > 8. Accordingly, we should have $S_{T} = {1, 2, \dots, 8}$ and $I_{T} = {(1, 2), (3, 4), (5, 6), (7, 8)}$ . The detailed results are given in Table 4. The basic findings are qualitatively similar to those in Table 2. The only difference is that the CIE value no longer converges towards 0. Instead, it converges towards $∣ I_{T} ∣ = 4$ as n → ∞ and p fixed. Also, CP values for $\hat{S}$ are no longer near 100% since $\hat{S}$ only takes main effect into consideration. Instead, the CP value for $\hat{S} + \hat{I}$ converges towards 100% as n increases and p is fixed.

Table 3.

Probability Specification for Example 2

	j
θ _kj	1	3	5	7
k=1	0.8	0.8	0.7	0.9
k=2	0.1	0.3	0.2	0.3
k=3	0.7	0.9	0.1	0.1
k=4	0.2	0.1	0.9	0.7

Open in a new tab

Table 4.

Example 2 Detailed Simulation Results

			Main Effect		Interaction Effect
p	n	Method	CME	IME	CIE	IIE	MS	CP%
1000	200	$S_{F}$	8.0	992.0	0.0	0.0	1000.0	66.6
		$\hat{S}$	5.4	0.0	0.0	0.0	5.4	45.7
		$\hat{S} + \hat{I}$	5.4	0.0	1.4	5.0	12.0	58.2
	500	$S_{F}$	8.0	992.0	0.0	0.0	1000.0	66.6
		$\hat{S}$	7.8	0.0	0.0	0.0	7.8	65.5
		$\hat{S} + \hat{I}$	7.8	0.0	3.8	1.1	12.8	97.8
	1000	$S_{F}$	8.0	992.0	0.0	0.0	1000.0	66.6
		$\hat{S}$	8.0	0.0	0.0	0.0	8.0	66.6
		$\hat{S} + \hat{I}$	8.0	0.0	4.0	0.2	12.2	100.0

5000	200	$S_{F}$	8.0	4992.0	0.0	0.0	5000.0	66.6
		$\hat{S}$	4.9	0.0	0.0	0.0	4.9	41.2
		$\hat{S} + \hat{I}$	4.9	0.0	0.9	4.0	9.9	49.5
	500	$S_{F}$	8.0	4992.0	0.0	0.0	5000.0	66.6
		$\hat{S}$	7.5	0.0	0.0	0.0	7.5	63.1
		$\hat{S} + \hat{I}$	7.5	0.0	3.5	1.7	12.8	92.9
	1000	$S_{F}$	8.0	4992.0	0.0	0.0	5000.0	66.6
		$\hat{S}$	7.9	0.0	0.0	0.0	7.9	66.6
		$\hat{S} + \hat{I}$	7.9	0.0	3.9	0.2	12.2	99.9

Open in a new tab

3.3. Example 3: a Model with both Categorical and Continuous Variables

We consider here an example with both categorical and continuous variables. Fix $∣ S_{T} ∣ = 20$ . Here, Y_i ∈ {1, 2} is generated according to P(Y_i = 1) = P(Y_i = 2) = 1/2. Given Y_i = k, we generate latent variable $Z_{i} = {(Z_{i 1}, Z_{i 2} \dots Z_{ip})}^{⊺} \in R^{p}$ with Z_ij independently distributed as N(μ_kj, 1), where μ_kj = 0 for any j > d₀, μ_kj = −0.5 if Y_i = 1 and j ≤ d₀, and μ_kj = 0.5 if Y_i = 2 and j ≤ d₀. Lastly, we construct observed feature X_ij as follows. If j is an odd number, we then define X_ij = Z_ij. Otherwise, define X_ij = I(Z_ij > 0). As a result, this example involves a total of d₀ = 20 features are relevant. Half of them are continuous and half of them are categorical. To apply our method, we need to first discretize the continuous variables to be categorical. Specifically, let z_α stand for the αth quantile of a standard normal distribution. We then re-define those continuous predictors as X_ij = 1 if X_ij < z_0.25, X_ij = 2 if z_0.25 < X_ij < z_0.50, X_ij = 3 if z_0.50 < X_ij < z_0.75, and X_ij = 4 if X_ij > z_0.75. By doing so all the features become categorical. We next apply our method to the converted datasets by using p-values as described in the Remark 1. The experiment is replicated in a similar manner as before with detailed results summarized in Table 5. The results are qualitatively similar to those in Example 1.

Table 5.

Example 3 Detailed Simulation Results

			Main Effect		Interaction Effect
p	n	Method	CME	IME	CIE	IIE	MS	CP%
1000	200	$S_{F}$	20.0	980.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	17.9	0.2	0.0	0.0	18.2	89.6
		$\hat{S} + \hat{I}$	17.9	0.2	0.0	0.3	18.5	89.6
1000	500	$S_{F}$	20.0	980.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	19.9	0.0	0.0	0.0	19.9	99.9
		$\hat{S} + \hat{I}$	19.9	0.0	0.0	0.0	19.9	99.9
1000	1000	$S_{F}$	20.0	980.0	0.0	0.0	1000.0	100.0
		$\hat{S}$	20.0	0.0	0.0	0.0	20.0	100.0
		$\hat{S} + \hat{I}$	20.0	0.0	0.0	0.0	20.0	100.0

5000	200	$S_{F}$	20.0	4980.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	15.7	0.2	0.0	0.0	16.0	78.9
		$\hat{S} + \hat{I}$	15.7	0.2	0.0	1.0	17.1	79.1
5000	500	$S_{F}$	20.0	4980.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	19.9	0.0	0.0	0.0	19.9	99.9
		$\hat{S} + \hat{I}$	19.9	0.0	0.0	0.0	19.9	99.9
5000	1000	$S_{F}$	20.0	4980.0	0.0	0.0	5000.0	100.0
		$\hat{S}$	20.0	0.0	0.0	0.0	20.0	100.0
		$\hat{S} + \hat{I}$	20.0	0.0	0.0	0.0	20.0	100.0

Open in a new tab

4. REAL DATA ANALYSIS

4.1. A Chinese Keyword Dataset

The data contains a total of 639 keywords (i.e., samples), which are classified into K = 13 categories. The total number of Chinese characters involved is p = 341. For each class, we randomly split the sample into two parts with equal sizes. One part is used for training and the other for testing. The sample size of the training data is n = 320. Based on the training data, models are selected by the proposed PC-SIS method and various classification methods (i.e., kNN, SVM, and RF) are applied. Their forecasting accuracies are examined on the testing data. For a reliable evaluation, such an experiment is randomly replicated 200 times. The detailed results are given in Table 6. As seen, the PC-SIS estimated main effect model $\hat{S}$ , with size 14.6 on average, consistently outperforms the full model $S_{F}$ , regardless of the classification method. The relative improvement margin could be as high as 87.2%-51.1%=36.1% for SVM. Such an outstanding performance can be further improved by including about 22.3 interaction effects. The maximum improvement margin is 78.0%-67.6%=10.4% for RF.

Table 6.

Detailed Results for Search Engine Marketing Dataset

	Model	Main	Interaction	Forecasting Accuracy%
Method	Size	Effect	Effect	kNN	SVM	RF
$S_{F}$	341.00	341.00	0.00	76.89	51.13	60.57
$\hat{S}$	14.60	14.60	0.00	85.20	87.19	67.55
$\hat{S} + \hat{I}$	36.85	14.60	22.25	86.96	88.66	78.01

Open in a new tab

4.2. Labor Supply Dataset

We next consider a dataset about labor supply. This is an important dataset generously donated by Mroz (1987) and was discussed by Wooldridge (2002). It contains a total of 753 married white women aged between 30 to 60 in the year of 1975. For illustration purpose, we take a binary variable Y_i ∈ {0, 1} as the response of interest, which indicates whether the woman participated to the labor market or not. The dataset contains a total of 77 predictive variables with interaction terms included. These variables were observed for both participated and non-participated women. They are recorded by X_i. Understanding the regression relationship between X_i and Y_i is useful for calculating the propensity score for a woman’s employment decision (Rosenbaum and Rubin , 1983). However, due to its high dimensionality, directly using all the predictors for propensity score estimation is suboptimal. Thus, we are motivated to apply our method for variable screening.

Following similar strategy, we randomly split the dataset into two parts with equal sizes. One part is used for training and the other for testing. We then apply PC-SIS method to the training dataset. Because this dataset involves both continuous and categorical predictors, the method of discretization (as given in simulation Example 3) is used. We then apply PC-SIS to the discretized dataset, which leads to estimated model $\hat{S}$ . Because the interaction terms with good economical meanings are already included in X_i (Mroz , 1987), we did not further pursue the interaction model $\hat{I}$ . An usual logistic regression model is then estimated based on the training dataset, and the resulting model’s forecasting accuracy is evaluated on the testing data in terms of AUC, which is area under the ROC curve (Wang, 2007). The definition is given as follows. Let $\hat{β}$ be the maximum likelihood estimator, which is obtained by conducting a logistic regression model for Y_i and X_i but based on the training data. Denote the testing dataset, which can be further decomposed as $T = T_{0} ⋃ T_{1}$ with $T_{0} = {i \in T : Y_{i} = 0}$ and $T_{1} = {i \in T : Y_{i} = 1}$ . Simply speaking, $T_{0}$ and $T_{1}$ collect indices of those testing samples with response being 0 and 1, respectively. Then, AUC in Wang (2007) is defined as

AUC = \frac{1}{n_{0} n_{1}} \sum_{{i_{1} \in T_{1}}} \sum_{{i_{2} \in T_{0}}} I (X_{i_{1}}^{⊺} \hat{β} > X_{i_{2}}^{⊺} \hat{β}),

(4.1)

where n₀ and n₁ are the sample sizes of $T_{0}$ and $T_{1}$ , respectively.

For comparison purpose, the full model $S_{F}$ is also evaluated. For a reliable evaluation, the experiment is randomly replicated 200 times. We find that a total of 10.20 features are selected on average with AUC=98.03%, which is extremely comparable to that of the full model (i.e., AUC=98.00%) but with substantially reduced features. Lastly, we apply our method to the whole dataset, with 10 important main effects identified and no interaction is included. The 10 selected main effects are, respectively, family income, after tax full income, wife’s weeks worked last year, wife’s usual hours of work per week last year, actual wife experience, salary, hourly wage, overtime wage, hourly wage from the previous year, and a variable indicating whose hourly wage from the previous year is not 0.

Remark 2. One can also evaluate AUC according to (4.1) but based on the whole sample and then optimize it with respect to an arbitrary regression coefficient β. This leads to the Maximum Rank Correlation (MRC) estimator, which has been well studied by Han (1987), Sherman (1993), and Baker (2003).

5. CONCLUDING REMARKS

To conclude this article, we discuss here two interesting topics for future study. First, as we discussed before, the proposed method and theory can be readily extended to the situation with general categorical predictors. Second, we assume here the number of response classes (i.e., K) is finite. How to conduct variable selection and screening with a diverging K is theoretically challenging.

Acknowledgments

His research was supported by National Institute on Drug Abuse (NIDA) grant P50-DA10075. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or NIDA. The authors are grateful to the Editor, the AE, and the anonymous referees for their constructive comments, which lead to a much improved manuscript. The authors are also grateful to Prof. Mroz for sharing with us this interesting labor supply dataset.

APPENDIX. Proof of Theorem 1

The proof of Theorem 1 consists of five steps. First, we show that there exists a lower bound on Δ_j for every $j \in S_{T}$ . Second, we establish ${\hat{Δ}}_{j}$ as a uniformly consistent estimator of Δ_j which is over 1 ≤ j ≤ p. Last, we argue that there b exists a positive constant c such that $\hat{S} = S_{T}$ with probability tending to 1.

Step 1. By definition, we have $ω_{j}^{k_{1} k_{2}} = π_{yj, k_{1} k_{2}} - π_{{yk}_{1}} π_{{jk}_{2}}$ . Then for every $j \in S_{T}$ , by Condition (C1), π_yk and π_jk are both upper bounded by π_max. We then have $Δ_{j} = \sum_{k_{1} k_{2}} {{(ω^{k_{1} k_{2}})}^{2} {(π_{{yk}_{1}} π_{{jk}_{2}})}^{- 1}} \geq π_{\max}^{- 2} \sum_{k_{1} k_{2}} {(ω_{j}^{k_{1} k_{2}})}^{2}$ . Next, by Condition (C2), if $j \in S_{T}, \sum_{k_{1} k_{2}} {(ω_{j}^{k_{1} k_{2}})}^{2} \geq \max_{k_{1} k_{2}} (ω_{k_{1} k_{2}}^{2}) \geq ω_{\min}$ . These results together make Δ_j lower bounded by $ω_{\min} π_{\max}^{- 2}$ . We can then define $Δ_{\min} = 0.5 ω_{\min} π_{\max}^{- 2}$ , which is a positive constant resulting in min_{j∈S_T} Δ_j > Δ_min.

Step 2. The proof of uniform consistency for ${\hat{π}}_{jk}$ and ${\hat{π}}_{yj, k_{1} k_{2}}$ are similar. As a result, we omit the details of ${\hat{π}}_{yj, k_{1} k_{2}}$ . Also, based on the uniform consistency of ${\hat{π}}_{jk}$ and ${\hat{π}}_{yj, k_{1} k_{2}}$ , the uniform consistency of ${\hat{Δ}}_{j}$ needs only some standard argument using Taylor’s expansion. The technical details of ${\hat{Δ}}_{j}$ ’s uniform consistency is also omitted. We focus on ${\hat{π}}_{jk}$ only.

To this end, we define Z_i,jk = I(X_ij = k) − π_jk. By that we know ${EZ}_{ij, k} = 0, {EZ}_{ij, k}^{2} = π_{jk} - π_{jk}^{2}$ , and ∣Z_ij,k∣ ≤ M with M = 1. Also, for a fixed pair of (j, k), we know that Z_ij,k are independent for i. All those conditions remind us of Bernstein’s inequality, by which we have,

P (\sum_{i} Z_{ij, k} > ε) \leq \exp {\frac{- 3 ε^{2}}{2 M ε + 6 n (π_{jk} - π_{jk}^{2})}}

where ε > 0 is an arbitrary positive constant. Since M = 1 and $π_{jk} - π_{jk}^{2} \leq 1 ∕ 4$ , the right-hand side of the inequality can further bounded above by exp{−6ε²/(4ε + 3n)}, Thus,

P (∣ \frac{1}{n} \sum_{i} Z_{ij, k} ∣ > ε) \leq 2 and {\frac{- 6 n^{2} ε^{2}}{4 n ε + 3 n}} = 2 \exp {\frac{- 6 n ε^{2}}{4 ε + 3}} .

With ${\hat{π}}_{k, j} - π_{k, j} = n^{- 1} \sum_{i} Z_{ij, k}$ , we have

\begin{matrix} P (\max_{k} \max_{1 \leq j \leq p} ∣ {\hat{π}}_{k, j} - π_{k, j} ∣ > ε) & = P (\max_{k} \max_{1 \leq j \leq p} ∣ \frac{1}{n} \sum_{i} Z_{ij, k} ∣ > ε) \\ \leq \sum_{jk} P (\frac{1}{n} ∣ \sum_{i} Z_{ij, k} ∣ > ε) \\ \leq 2 K \exp {\log p + \frac{- 6 n ε^{2}}{4 ε + 3}} \to 0, \end{matrix}

(A.1)

where the first inequality is due to Bonferonni’s inequality. By Condition (C3), the righthand side of the final inequality goes to 0 as n → ∞. Then we have, under Condition (C1) – (C3), $\max_{k} \max_{1 \leq j \leq p} ∣ {\hat{π}}_{jk} - π_{jk} ∣ = o_{p} (1)$ .

Step 3. Recall that $Δ_{\min} = 0.5 ω_{\min} π_{\max}^{- 2}$ . Define c = (2/3)Δ_min and we should have $\hat{S} \supset S_{T}$ . Otherwise, there must exist a $j^{*} \in S_{T}$ but $j^{*} \notin \hat{S}$ . Accordingly, we must have ${\hat{Δ}}_{j^{*}} \leq (2 ∕ 3) Δ_{\min}$ and Δ_j* > Δ_min. Thus $∣ {\hat{Δ}}_{j^{*}} - Δ_{j^{*}} ∣ > (1 ∕ 3) Δ_{\min}$ , which implies, if $\hat{S} ⊅ S_{T}$ then $\max ∣ {\hat{Δ}}_{j} - Δ_{j} ∣ > (1 ∕ 3) Δ_{\min}$ . On the other hand, we know by ${\hat{Δ}}_{j}$ ’s uniform consistency, with $ε = (1 ∕ 3) Δ_{\min}, P (\hat{S} ⊅ S_{T}) \leq P (\max ∣ {\hat{Δ}}_{j} - Δ_{j} ∣ > (1 ∕ 3 Δ_{\min}) \to 0$ , as n → ∞.

Similarly, we have $S_{T} \supset \hat{S}$ . Or else there should be a $j^{*} \in \hat{S}$ but $j^{*} \notin S_{T}$ . Thus ${\hat{Δ}}_{j^{*}} \geq (2 ∕ 3) Δ_{\min}$ and Δ_j* = 0. We should have $∣ {\hat{Δ}}_{j^{*}} - Δ_{j^{*}} ∣ > (2, 3) Δ_{\min}$ . Let ε = (2/3)Δ_min, and by uniform consistency again, we have $P (S_{T} ⊅ \hat{S}) \leq P (\max ∣ {\hat{Δ}}_{j} - Δ_{j} ∣ > (2 ∕ 3) Δ_{\min}) \to 0$ , as n → ∞. As a result, we know that $P (\hat{S} = S_{T}) \to 1$ with c = (2/3)Δ_min, as n → ∞. This completes the proof.

REFERENCES

Baker SG. The central role of receiver operating characteristics (roc) curves in evaluating tests for the early detection of cancer. Journal of the National Cancer Institute. 2003;95:511–515. doi: 10.1093/jnci/95.7.511. [DOI] [PubMed] [Google Scholar]
Breiman L. Random forest. Machine Learning. 2001;45:5–32. [Google Scholar]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 2011;116:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]
Fan J, Song R. Sure independent screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
Han AK. Nonparametric analysis of a generalized regression model. Journal of Econometrics. 1987;35:303–316. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001. [Google Scholar]
He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics. 2013;41:342–369. [Google Scholar]
Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research. 2005;6:37–53. [Google Scholar]
Li GR, Peng H, Zhang J, Zhu LX. Robust Rank correlation based screening. Annals of Statistics. 2012;40:1846–1877. [Google Scholar]
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh dimensional covariates. Department of Statistics, The Pennsylvania State University; 2013. Technical Report, # 2013–01. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika. 2013;100:229–234. [Google Scholar]
Mroz T. The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions. Econometrica. 1987;55:765–799. [Google Scholar]
Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]
Sherman RP. The limiting distribution of the maximum rank correlation estimator. Econometrica. 1993;61:123–137. [Google Scholar]
Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Annals of Statistics. 2007;35:2769–2794. [Google Scholar]
Tong S, Koller D. Support vector machine active learning with application to text classification. Journal of Machine Learning Research. 2001:45–66. [Google Scholar]
Wang H. A note on iterative marginal optimization: a simple algorithm for maximum rank correlation estimation. Computational Statistics & Data Analysis. 2007;51:2803–2812. [Google Scholar]
Wang H. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]
Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. MIT Press; Cambridge, MA: 2002. [Google Scholar]
Zhu LP, Li L, Li R, Zhu LX. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Baker SG. The central role of receiver operating characteristics (roc) curves in evaluating tests for the early detection of cancer. Journal of the National Cancer Institute. 2003;95:511–515. doi: 10.1093/jnci/95.7.511. [DOI] [PubMed] [Google Scholar]

[R2] Breiman L. Random forest. Machine Learning. 2001;45:5–32. [Google Scholar]

[R3] Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association. 2011;116:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space (with discussion) Journal of the Royal Statistical Society, Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research. 2009;10:1829–1853. [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J, Song R. Sure independent screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]

[R7] Han AK. Nonparametric analysis of a generalized regression model. Journal of Econometrics. 1987;35:303–316. [Google Scholar]

[R8] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001. [Google Scholar]

[R9] He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics. 2013;41:342–369. [Google Scholar]

[R10] Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research. 2005;6:37–53. [Google Scholar]

[R11] Li GR, Peng H, Zhang J, Zhu LX. Robust Rank correlation based screening. Annals of Statistics. 2012;40:1846–1877. [Google Scholar]

[R12] Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh dimensional covariates. Department of Statistics, The Pennsylvania State University; 2013. Technical Report, # 2013–01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika. 2013;100:229–234. [Google Scholar]

[R15] Mroz T. The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions. Econometrica. 1987;55:765–799. [Google Scholar]

[R16] Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70:41–55. [Google Scholar]

[R17] Sherman RP. The limiting distribution of the maximum rank correlation estimator. Econometrica. 1993;61:123–137. [Google Scholar]

[R18] Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Annals of Statistics. 2007;35:2769–2794. [Google Scholar]

[R19] Tong S, Koller D. Support vector machine active learning with application to text classification. Journal of Machine Learning Research. 2001:45–66. [Google Scholar]

[R20] Wang H. A note on iterative marginal optimization: a simple algorithm for maximum rank correlation estimation. Computational Statistics & Data Analysis. 2007;51:2803–2812. [Google Scholar]

[R21] Wang H. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association. 2009;104:1512–1524. [Google Scholar]

[R22] Wooldridge JM. Econometric Analysis of Cross Section and Panel Data. MIT Press; Cambridge, MA: 2002. [Google Scholar]

[R23] Zhu LP, Li L, Li R, Zhu LX. Model-free feature screening for ultrahigh dimensional data. Journal of the American Statistical Association. 2011;106:1464–1475. doi: 10.1198/jasa.2011.tm10563. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Feature Screening for Ultrahigh Dimensional Categorical Data with Applications

Danyang Huang

Runze Li

Hansheng Wang

Abstract

1. INTRODUCTION

2. The Pearson Chi-Square Test based Screening Procedure

2.1. Sure Independence Screening

Remark 1

2.2. Theoretical Properties

2.3. Interaction Screening

2.4. Tuning Parameter Selection

3. SIMULATION STUDIES

3.1. Example 1: a Model without Interaction

Table 1.

Table 2.

3.2. Example 2: a Model with Interaction

Table 3.

Table 4.

3.3. Example 3: a Model with both Categorical and Continuous Variables

Table 5.

4. REAL DATA ANALYSIS

4.1. A Chinese Keyword Dataset

Table 6.

4.2. Labor Supply Dataset

5. CONCLUDING REMARKS

Acknowledgments

APPENDIX. Proof of Theorem 1

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Feature Screening for Ultrahigh Dimensional Categorical Data with Applications

Danyang Huang

Runze Li

Hansheng Wang

Abstract

1. INTRODUCTION

2. The Pearson Chi-Square Test based Screening Procedure

2.1. Sure Independence Screening

Remark 1

2.2. Theoretical Properties

2.3. Interaction Screening

2.4. Tuning Parameter Selection

3. SIMULATION STUDIES

3.1. Example 1: a Model without Interaction

Table 1.

Table 2.

3.2. Example 2: a Model with Interaction

Table 3.

Table 4.

3.3. Example 3: a Model with both Categorical and Continuous Variables

Table 5.

4. REAL DATA ANALYSIS

4.1. A Chinese Keyword Dataset

Table 6.

4.2. Labor Supply Dataset

5. CONCLUDING REMARKS

Acknowledgments

APPENDIX. Proof of Theorem 1

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases