Abstract
Generalized varying coefficient models are particularly useful for examining dynamic effects of covariates on a continuous, binary or count response. This paper is concerned with feature screening for generalized varying coefficient models with ultrahigh dimensional covariates. The proposed screening procedure is based on joint quasi-likelihood of all predictors, and therefore is distinguished from marginal screening procedures proposed in the literature. In particular, the proposed procedure can effectively identify active predictors that are jointly dependent but marginally independent of the response. In order to carry out the proposed procedure, we propose an effective algorithm and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. We examine the finite sample performance of the proposed procedure and compare it with existing ones via Monte Carlo simulations, and illustrate the proposed procedure by a real data example.
Keywords: Generalized varying-coefficient models, ultrahigh dimensional data, variable screening
1. Introduction
Generalized linear models have been well studied in the literature. Variable selection via penalized likelihood has been developed for generalized linear models with large dimensional covariates (Tibshirani,1996; Fan and Li, 2001). Ultrahigh dimensional data have been collected in various research areas such as genome-wide association studies, proteomics studies, finance, tumor classification and biomedical imaging. Variable selection methods based on penalized likelihood may not perform well for ultrahigh dimensional data due to their algorithmic stability, computational cost and statistical accuracy (Fan, et al., 2009). Fan and Lv (2008) advocates a two stage approach: (a) reduce ultrahigh dimensional covariates to large dimensional by filtering out a large number of irrelevant covariates based on a marginal screening procedure, and (b) apply variable selection methods to the reduced model with large dimensional covariates. Fan and Lv (2008) proposed a sure independence screening (SIS) procedure for linear models using Pearson correlation coefficient as the marginal utility and further established the sure screening property of their procedure under Gaussian linear model framework. Hall and Miller (2009) proposed a feature screening procedure for transformation linear model by using generalized correlation and Li, et al. (2012) advocated using rank correlation for screening to deal with heavy-tailed distribution and the presence of outlier. Fan, et al. (2009) proposed a SIS procedure for generalized linear models based on marginal likelihood estimate. More details about these marginal feature screening procedures can be found at the recent review paper on feature screening by Liu, et al. (2015).
Varying coefficient models (VCM) were proposed to deal with “curse of dimensionality” (Cleveland, et al., 1992; Hastie and Tibshirani, 1993). As a natural extension of linear regression models by allowing coefficients varying over a variable such as age and time, the VCM are particularly useful for exploring dynamic pattern of effects and have been used in various research fields (See, e.g., Zhu, et al., 2011; Tan, et al, 2012; Liu, et al, 2014). Feature screening procedures for VCM with ultrahigh dimensional covariates (referred to as ultrahigh dimensional VCM for short) have been proposed in the literature. Liu, et al. (2014) developed an SIS procedure for ultrahigh dimensional VCM by taking conditional Pearson correlation coefficients as marginal utility for ranking importance of predictors. Fan, et al. (2014) proposed an SIS procedure for ultrahigh dimensional VCM by extending B-spline techniques in Fan, et al. (2011) for additive models. Xia, et al. (2016) further extends the SIS procedure proposed in Fan, et al. (2014) to generalized varying coefficient models (GVCM). Cheng, et al. (2016) proposed a forward variable selection procedure for ultrahigh dimensional VCM based on techniques related B-splines regression and grouped variable selection. Song, et al. (2014) extended the proposal of Fan, et al. (2014) for longitudinal data without taking into within-subject correlation, while Chu, et al. (2016) proposed an SIS procedure for longitudinal data based on weighted residual sum of squares to use within-subjection correlation to improve accuracy of feature screening. Although feature screening for ultrahigh dimensional VCM is an active research topic in the literature, there is little work on joint feature screening for ultrahigh dimensional GVCM, which is particularly useful to examine dynamic effects of covariates on a binary, count or continuous response. For example, Li and Zhang (2011) proposed a new semiparametric threshold model for censored longitudinal data analysis. Cheng, et al. (2014) offered a new automatic procedure for finding a sparse semivarying coefficient model, which is widely accepted for longitudinal data analysis. This paper intends to fill this gap.
In this paper, we propose a new feature screening procedure for ultrahigh-dimensional GVCM. The proposed procedure is based on joint likelihood of potential active predictors and therefore is distinguished from the existing SIS procedures (Fan, et al., 2014; Liu, et al., 2014; Xia, et al., 2016) in that the proposed procedure is not a marginal screening procedure. Wang (2009) proposed a forward regression approach to feature screening in ultrahigh dimensional linear models. Cheng, et al. (2016) further extended the forward regression procedure for ultrahigh dimensional VCM based on techniques related B-splines regression and grouped variable selection. Xu and Chen (2014) proposed a feature screening procedure for generalized linear models via the sparsity-restricted maximum likelihood estimator. As demonstrated in Wang (2009), Xu and Chen (2014) and Cheng, et al. (2016), their approaches can perform better than the sure independence screening procedures, and can effectively identify predictors that are jointly dependent but marginal independent of the response. In this paper, we develop a new screening procedure for the ultra-high dimensional GVCM based on joint likelihood of potential active predictors. The proposed procedure can effectively identify active predictors that are jointly dependent but marginal independent of the response without performing an iterative procedure. We develop a computationally effective algorithm to carry out the proposed procedure and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. In summary, this work makes the following major contributions to the literature. (a) We propose a sure joint screening (SJS) procedure for ultrahigh dimensional GVCM. We further propose an effective algorithm to carry out the proposed screening procedure, and demonstrate the ascent property of the proposed algorithm. (b) We establish the screening property for the proposed joint screening procedure.
The rest of this paper is organized as follows. In Section 2, we propose a new feature screening for the ultrahigh dimensional GVCM, and develop an effective algorithm for the proposed screening procedure. We further study theoretical properties of the proposed procedure and algorithm. In Section 3, we present numerical comparisons and an empirical analysis of a real data example. Some discussion and conclusion remarks are given in Section 4. Technical proofs are given in the Appendix.
2. New feature screening procedure for generalized varying coefficient models
Let Y be the response variable and {x, U} its associated covariates, where x = (X1, · · · , Xp) and U be p-dimensional and univariate covariates respectively. Further, let µ(x, U) = E(Y |x, U). The GVCM assumes that
| (2.1) |
where g(·) is a known link function and α(·) is a vector consisting of unspecified smooth regression coefficient functions. Here it is assumed that all αj(·)’s are nonparametric functions and the support of U is finite and denoted by [a, b].
Suppose that {Ui, xi, Yi}, i = 1, … , n, constitute an independent and identically distributed sample and that conditionally on {Ui, xi}, the conditional quasi-likelihood of Yi is Q{µ(Ui, xi), Yi}, where the quasi-likelihood function is defined by , or equivalently , for a specific variance function V (s). Denote by ℓ{α(·)} the quasi-likelihood (McCullagh and Nelder, 1989) of the collected data {(Ui, xi, Yi), i = 1, … , n}. That is
| (2.2) |
To estimate the nonparametric regression coefficient, we use B-spline regression method. Let be the space of polynomial splines of degree and denote a normalized B-spline basis with and , where ǁ · ǁ∞ is the sup norm. For any , we have
| (2.3) |
for some coefficients . Here increases with n. We allow to be different for different j since different coefficient functions may have different smoothness. Under some conditions, each nonparametric coefficient function αj(U), j = 1,···, p can be well approximated by functions in .
Substituting (2.3) into (2.2), the maximum quasi-likelihood estimate of (2.2) is to maximize
| (2.4) |
with respect to β, where zi = (Xi1ψ1(Ui)T , · · · , Xipψp(Ui)T)T and . With slight abuse notation, we use ℓ{α(·)} in (2.2) and ℓ(β) in (2.4). However, the notation will be clear in the context. In the presence of ultrahigh dimensional covariate x, the corresponding optimization problem becomes ill-posed. It is typical to assume sparsity. That is, only a few x-covariates are significant, and the others do not have impact on the response. We next propose a feature screening procedure for model (2.1).
2.1. A new feature screening procedure
Denote , the L2-norm of αj(U). For ease of presentation, s denotes an arbitrary subset of , and . For a set s, τ (s) stands for the cardinality of s. Suppose the effect of x is sparse, and the true value of α(U) is α*(U), so β is corresponding to β*. Denote . By sparsity, we means that τ (s*) is much less than p. The goal of feature screening is to identify a subset s such that with overwhelming probability and τ (s) is also much less than p. Theoretically we may formulate this problem to be an optimization problem as below:
| (2.5) |
for a pre-specified m, which is presumed to be much less than p.
When the approximation error is negligible, we construct a feature screening procedure by considering the following maximization problem:
| (2.6) |
Note that . Under the assumption that is finite positive definite for all j = 1,···, p, the maximization problem in (2.6) is equivalent to
| (2.7) |
For high dimensional problems, it becomes almost impossible to solve the constrained maximization problem (2.7) directly. Alternatively, we consider a proxy of the quasi-likelihood function. It follows by the Taylor expansion for the quasi-likelihood function ℓ(γ) at β lying within a neighbor of γ that
where and . Denote . If is invertible, the computational complexity of calculating the inverse of is . For large Pt, small n problems (i.e. ), becomes not invertible. Low computational cost is always desirable for feature screening. To cope with singularity of the Hessian matrix and save computational cost, we propose using the following approximation for ℓ″ (γ)
| (2.8) |
where u is a scaling constant to be specified and W (β) = diag(W1(β), ···, Wp(β)), a block diagonal matrix with Wj(β) being a matrix. Here we allow W (β) to depend on β. This implies that we approximate ℓ″(β) by −uW (β). Throughout this paper, we will use .
It can be seen that h(β|β) = ℓ(β), and under some conditions, h(γ|β) ≤ ℓ(β) for all γ. This ensures the ascent property. See Theorem 1 below for more details. Since W (β) is a block diagonal matrix, h(γ|β) is an additive function of γj for any given β. The additivity enables us to have a closed form solution for the following maximization problem
| (2.9) |
for given β and m. Define for , and is the maximizer of h(γ|β). Denote for , and sort gj so that g(1) ≥ g(2) ≥ ··· ≥ g(p). The solution of maximization problem (2.9) is the hard-thresholding rule defined below
This enables us to effectively screen features by using the following algorithm.
Step 1. Set the initial value .
Step 2. Set t = 0, 1, 2,···, iteratively conduct Step 2a and Step 2b below until the algorithm converges.
Step 2a. Calculate , and . Let , the order statistics of . Set , the nonzero index set.
Step 2b. Update β by as follows. If , set , otherwise, set be the maximum likelihood estimate of the submodel St.
Remark: Unlike the screening procedures based on marginal partial likelihood methods, our proposed procedure is to iteratively update β using Step 2. This enables the proposed screening procedure to incorporate correlation information among the predictors through updating and . Thus, the proposed procedure is expected to perform better than the marginal screening procedures when there are some predictors that are marginally independent. Meanwhile, since each iteration in Step 2 can avoid large-scale matrix inversion and, therefore, it can be carried out with low computational costs.
Theorem 1. Let {β(t)} be the sequence defined in Step 2b in the above algorithm. Denote
Here and hereafter λmax(A) and λmin(A) stands for the maximal and the minimal eigenvalues of a matrix A, respectively. If ut ≥ ρ(t), then
where β(t+1) is defined in Step 2b in the above algorithm.
Theorem 1 claims the ascent property of the proposed algorithm if ut is appropriately chosen. That is, the proposed algorithm may improve the current estimate within the feasible region (i.e. ), and the resulting estimate in the current step may serve as a refinement of the last step. This theorem also provides us some insights about choosing ut in practical implementation. For varying coefficient models: , we may set . In this case in (2.4) is . Thus, , where Z is matrix with i-th row being . Thus,
which does not depend on the step of iteration t. If zi’s are marginally standardized so that its marginal sample mean and sample standard deviation equal 0 and 1, respectively, then diag(ZT Z)−1/2(ZT Z)diag(ZT Z)−1/2 is the corresponding sample correlation matrix of zi’s. Thus, ρ is the largest eigenvalue of the sample correlation matrix.
2.2. Sure screening property
For a subset s of {1, … , p} with size τ (s), recall notation xs = {xj, j ∈ s} and associated coefficients αs(U) = {αj(U), j ∈ s} corresponding to with . We denote the true model by with . The objective of feature selection is to obtain a subset such that with very high probability.
We now provide some theoretical justifications for the screening procedure for the GVCM. The sure screening property (Fan and Lv, 2008)) is referred to as
| (2.10) |
To establish this sure screening property for the proposed feature screening method, we introduce some additional notations as follows. For any model s, let and be the score function and the Hessian matrix of ℓ(·) as a function of βs, respectively. Assume that a screening procedure retains m out of p features such that τ (s*) = q < m. So, we define
| (2.11) |
as the collections of the over-fitted models and the under-fitted models. We investigate the asymptotic properties of under the scenario where p, q, m and β* are allowed to depend on the sample size n. We impose the following conditions, some of which are purely technical and only serve to facilitate theoretical understanding of the proposed feature screening procedure.
(C1) The support of U is bounded and is assumed to be [a, b].
(C2) The functions belong to a class of functions , whose rth derivative exists and is Lipschitz of order η,
for some positive constant K, where r is a nonnegative integer and η ∈ (0, 1] such that υ = r + η > 0.5.
(C3) There exists w1, w2 > 0 and for some non-negative constants τ1, τ2 such that τ1 + τ2 < 1/2 with
(C4) log p = O(nκ) for some 0 ≤ κ < 1 − 2(τ1 + τ2).
(C5) µ′(·)/V (·) is bounded by some constant M > 0.
(C6) There exist constants C1, C2 > 0, δ > 0, such that for sufficiently large n,
for and , where λmin[ ] and λmax[ ] denote the smallest and largest eigenvalues of a matrix.
Under Conditions (C1) and (C2), the following two properties of B-splines are valid.
(de Boor, 1978) For , and , . In addition, there exist positive constants C3 and C4 such that .
(Stone, 1982, 1985) If {αj, j = 1, 2,···, p}is a set of functions in described in condition (C2), there exists a positive constant C5 that does not depend on αj(U) so that the uniform approximation error has the following bound. , , as .
Conditions (C1) and (C2) ensure properties (a) and (b), which are required for the B-spline approximation and establishing the sure screening properties.
Note that , based on the properties (a), (b) and Condition (C3), we can derive that
Condition (C3) states a few requirements for establishing the sure screening property of the proposed procedure. The first one is the sparsity of β* which makes the sure screening possible with . Condition (C3) requires that the signal of the active components does not vanish. This is referred to as minimal signal condition in the literature. Minimal signal condition is a commonly-imposed assumption in existing work on marginal feature screening for other model (e.g, Liu, et al., 2014). By (2.12), it is equivalent to requiring that the minimal component in β* does not degenerate too fast, so that the signal is detectable in the asymptotic sequence. Condition (C4) has p diverge with n at up to an exponential rate. Meanwhile, together with (C6), it confines an appropriate order of m that guarantees the identifiability of s* over s for τ (s)≤ m. For varying coefficient model discussed in Section 2.1, Condition (C6) requires
where Zs is the corresponding design matrix of model s. We establish the sure screening property of the quasi-likelihood estimation by the following theorem. In Fan and Song (2010), Condition D ensures the tail of the response variable Y to be exponentially light, as shown in the following Lemma 1. As for Condition D corresponds to our Condition (C6), so Condition (C6) can ensure Y bound.
Remark: In particular, our proposed screening procedure is based on joint quasi-likelihood of all predictors. However, Fan, Ma and Dai (2014) investigate marginal nonparametric screening methods to screen variables in sparse ultra-high-dimensional varying-coefficient models. As for conditions (v)-(vi) in Fan, Ma and Dai (2014), conditions (v) and (vi) are requirements for the tail distribution of each covariate, and the noise, to establish the sure screening property. However, errors need to be independent but not normally distributed. Corresponding to our condition (C6), we only need to assume the minimize and maximum eigenvalues of Hessian matrix are bounded.
Theorem 2. Suppose we have n independent observations with p candidate features from model (2.1) and conditions (C1)—(C6) are satisfied. Let be the features obtained by (2.5) of size m. Then, we have
The proof is given in the Appendix. The sure screening property is an appealing property of a screening procedure since it ensures that the true active predictors are retained in the model selected by the screening procedure. We establish the sure screening property under weaker conditions imposed in Fan, et al. (2014) and Xia, et al. (2016).
One has to specify the value of m in practical implementation. As to the choice of m, there are two scenarios. The first one chooses m by a data-driven method that described in Section 2.3. The second one is an ad hoc method. In the literature of feature screening, it is typical to set m = [n/ log(n)] for a parametric model, where [a] indicates the integer part of a (Fan and Lv, 2008). Since we use a linear combination of dn B-spline bases in our proposed screening procedure for the GVCM, we set m = [(n/dn)/ log(n/dn)] throughout in Examples 3.1, 3.2 and 3.3. Although it is an ad hoc choice, it works reasonably well in our numerical examples. With this choice of m, one is ready to further apply existing methods such as the penalized quasi-likelihood method to further remove inactive predictors. To be distinguished from the SIS procedure, the proposed procedure is referred to as sure joint screening (SJS) procedure.
2.3. Choice of m
Feature screening may be used in various contexts. In some contexts, people may treated m as a pre-specified value. For example, due to budget constraint, a biologist may be able to examine up to m genes that potentially associate with a certain phenotype. In other contexts, people may treat m as a tuning parameter to control model complexity. In such cases, it is desirable to develop an automatic data-driven method to determine m. We propose to select m by minimizing the high-dimensional BIC score:
where , , and Cn is a sequence of numbers that diverges to ∞. Wang, et al. (2013) proposed the HBIC for selecting tuning parameter in the penalized least squares method for high dimensional linear models. Here we modified their proposal for high dimensional generalized varying-coefficient models. In our simulation, we take Cn = log log n, and compare its performance with AIC and BIC tuning parameter selectors defined in the same manner. It is worth to noting that the proposed tuning parameter HBIC selector requires to search over m = 1, 2,· · ·, [n/dn]. This is distinguished from that the classical AIC and BIC used for subset selection requires to search over subsets. Thus, the tuning parameter selector does not require expensive computational cost.
Recall notation and defined in (2.11). Theorem 3 below shows that the HBIC selects the right model size almost surely.
Theorem 3. Suppose we have n independent observations with p candidate features from model (2.1) and conditions (C3)—(C6) are satisfied. Let sˆ be the features obtained by (2.4) and (2.7) of size m. Then, we have
| (2.13) |
where q = τ (s*), and
| (2.14) |
In Example 3.4, we will examine the performance of the proposed HBIC tuning parameter selector.
3. Numerical studies
In this section, we conduct numerical studies to examine the finite sample performance of the proposed procedures and compare it with the existing ones. All simulation are conducted by using R code. Examples 3.1, 3.2 and 3.3 examine the performance of the proposed screening procedures. Following the literature of feature screening (e.g, Fan and Lv, 2008), we set m = [n/ log(n)] in these examples. Example 3.4 examine the performance of the proposed HBIC, and m is determined by minimizing the HBIC score.
3.1. Simulation studies
In our simulation, the covariate u and x are generated as follows: First draw (U*, x)T from a p+1 dimensional normal distribution , then set U = Φ(U*), where Φ(·) is the cumulative distribution function of N (0, 1). Thus, U follows a uniform distribution U (0, 1) and is correlated with x, and all the predictors X1, …, Xp are correlated with each other. In our simulation, we consider two scenarios for Σ = (σij)
Σ1: Compound symmetric correlation structure: σij = 1 if i = j and ρ otherwise.
Σ2: AR(1) correlation structure: .
In our numerical studies, we set the number of B-spline basis functions to be , for each coefficient function. We use the following two criteria to assess the performance of the proposed procedure.
Pa: The proportion of submodels with size d that contain all the true predictors among 1000 simulations.
Pj: The proportion of submodels with size d that contain Xj among 1000 simulations.
Example 3.1. This example is designated to compare the proposed screening procedure with existing SIS procedures for VCM. Since the proposal of Xia, et al. (2016) under the setting of VCM coincides with that in Fan, et al. (2014), which shares the same spirit as that of Liu, et al. (2014), and Song, et al. (2014) and Chu, et al. (2016) were proposed for longitudinal data, we will concentrate on our comparison with CC-SIS proposed by Liu, et al. (2014). Given {U, x}, we generate a continuous response from
| (3.1) |
where ε ∼ N (0, 1). Model (3.1) implies that αj(·) = 0 for j > 4 and . We consider two sets of coefficient functions:
α1: Let , and .
α2: , , , .
In this example, we consider p = 1000 and 2000, and the sample size n = 200 and 400. All simulation results are based on 1000 replications. Simulation results are summarized in Tables 1 and 3.
Table 1:
The proportions of and for continuous response with
| CC-SIS | New (SJS) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | p | ρ | α(·) | ||||||||||
| 200 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| α2 | 0.995 | 1 | 1 | 0.992 | 0.987 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 1000 | 1/2 | α1 | 1 | 1 | 1 | 0.015 | 0.015 | 1 | 1 | 1 | 1 | 1 |
| α2 | 994 | 0.999 | 0.996 | 0.979 | 0.970 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 1000 | 2/3 | α1 | 0.995 | 0.997 | 0.995 | 0.302 | 0.297 | 1 | 1 | 0.999 | 1 | 0.999 |
| α2 | 0.976 | 0.995 | 0.984 | 0.942 | 0.909 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 0.001 | 0.001 | 1 | 1 | 1 | 1 | 1 |
| α2 | 0.992 | 0.999 | 0.998 | 0.989 | 0.979 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 2000 | 1/2 | α1 | 0.999 | 0.997 | 0.998 | 0.008 | 0.008 | 1 | 1 | 1 | 1 | 1 |
| α2 | 0.991 | 0.998 | 0.994 | 0.973 | 0.958 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 2000 | 2/3 | α1 | 0.989 | 0.987 | 0.985 | 0.284 | 0.274 | 1 | 1 | 0.993 | 1 | 0.993 |
| α2 | 0.974 | 0.999 | 0.976 | 0.932 | 0.892 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 1000 | 1/2 | α1 | 1 | 1 | 1 | 0.023 | 0.023 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 1000 | 2/3 | α1 | 1 | 1 | 1 | 0.623 | 0.623 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 1/2 | α1 | 1 | 1 | 1 | 0.011 | 0.011 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 2/3 | α1 | 1 | 1 | 1 | 0.549 | 0.549 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
Table 3:
Computing times (Seconds) and the Number of Iterations for Continuous Response
| Σ1 | Σ2 | |||||||
|---|---|---|---|---|---|---|---|---|
| α1 | α2 | α1 | α2 | |||||
| ρ | Time | Iterations | Time | Iterations | Time | Iterations | Time | Iterations |
| (n, p) = (200, 1000) | ||||||||
| 1/3 | 3.97(0.17) | 10(0) | 4.10(0.36) | 10(0) | 4.13(0.45) | 10(0) | 3.90(0.20) | 10(0) |
| 1/2 | 4.22(0.24) | 10(0) | 5.03(0.87) | 10(0) | 3.98(0.83) | 10(0) | 4.25(0.37) | 10(0) |
| 2/3 | 3.93(0.11) | 10(0) | 4.08(0.83) | 10(0) | 4.25(0.36) | 10(0) | 4.21(0.32) | 10(0) |
| (n, p) = (200, 2000) | ||||||||
| 1/3 | 7.87(0.47) | 10(0) | 7.37(0.63) | 10(0) | 8.04(0.70) | 10(0) | 7.24(0.20) | 10(0) |
| 1/2 | 7.91(0.59) | 10(0) | 8.40(0.53) | 10(0) | 7.98(0.53) | 10(0) | 7.25(0.21) | 10(0) |
| 2/3 | 7.75(0.61) | 10(0) | 7.03(0.64) | 10(0) | 8.05(0.35) | 10(0) | 7.15(0.39) | 10(0) |
| (n, p) = (400, 1000) | ||||||||
| 1/3 | 2.73(0.37) | 5(1) | 2.03(0.3) | 4(1) | 2.98(0.41) | 5(1) | 2.89(0.46) | 5(0) |
| 1/2 | 2.20(0.21) | 4(0) | 1.44(0.10) | 3(0) | 2.91(0.40) | 5(1) | 2.86(0.46) | 5(1) |
| 2/3 | 1.98(0.30) | 4(1) | 1.50(0.22) | 3(0) | 2.42(0.39) | 5(1) | 2.58(0.33) | 5(1) |
| (n, p) = (400, 2000) | ||||||||
| 1/3 | 4.87(0.67) | 5(1) | 3.73(0.47) | 4(0) | 4.87(0.57) | 5(1) | 6.01(0.98) | 5(1) |
| 1/2 | 3.69(0.29) | 4(0) | 3.34(0.55) | 3(0) | 5.97(1.05) | 5(1) | 6.03(0.93) | 5(1) |
| 2/3 | 3.18(0.43) | 4(0) | 2.34(0.68) | 3(0) | 4.67(0.68) | 5(1) | 6.54(1.72) | 5(1) |
Table 1 shows the values of and for continuous response with . Under the design of α1, X4 is jointly dependent but marginally independent of Y. In this setting, the marginal screening procedure fails to identify X4. As shown in Table 1, when there exists marginal independence, CC-SIS is unable to detect X4 whose values of and are near zero as expected. However, our method can identify X4 in this setting and the corresponding values of and are close to one. Therefore, our new procedure outperforms CC-SIS in the presence of marginal independence. Under the design of α2, there is no predictor that is jointly dependent but marginally independent of Y. Both CC-SIS and the proposed procedure perform very well, as the detecting probabilities are close to one. However, CC-SIS performs better when the sample size increases and the dimensionality decreases. On the other hand, those factors have less influences on the new procedure than CC-SIS. Furthermore, the corresponding values of and of our new procedure are closer to one in every case in this setting. In summary, when , regardless of whether marginal independence exists, our new procedure outperforms CC-SIS.
Table 2 shows the values of and for continuous response with . There is no predictor that is jointly dependent but marginally independent of Y. Hence both of the CC-SIS and the new procedure perform well, as most of the values of are greater than 0.9. Table 2 also indicates that when the sample size increases and the dimensionality decreases, both CC-SIS and our new procedure perform better. Furthermore, this table also shows that those factors have less effect on our new procedure. For instance, when n = 200, some values of obtained by CC-SIS are less than 0.8, but the corresponding values of of the new procedure are close to one. Besides, Table 2 shows that the new procedure performs better than CC-SIS in every case, which is consistent with our theoretical analysis since our new procedure has the sure screening property. Hence, our new procedure also outperforms CC-SIS in the setting of .
Table 2:
The proportions of and for continuous response with
| CC-SIS | New (SJS) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | p | ρ | α(·) | ||||||||||
| 200 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 0.644 | 0.644 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 1000 | 1/2 | α1 | 1 | 1 | 1 | 0.887 | 0.887 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 0.996 | 0.999 | 0.995 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 1000 | 2/3 | α1 | 1 | 1 | 0.741 | 0.990 | 0.731 | 1 | 1 | 0.952 | 1 | 0.952 |
| α2 | 1 | 0.745 | 0.999 | 1 | 0.744 | 1 | 1 | 0.998 | 1 | 0.998 | |||
| 200 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 0.551 | 0.551 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 2000 | 1/2 | α1 | 1 | 1 | 0.997 | 0.858 | 0.855 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 0.991 | 0.999 | 1 | 0.990 | 1 | 1 | 1 | 1 | 1 | |||
| 200 | 2000 | 2/3 | α1 | 1 | 1 | 0.678 | 0.991 | 0.669 | 1 | 1 | 0.903 | 1 | 0.903 |
| α2 | 0.999 | 0.693 | 0.999 | 1 | 0.692 | 1 | 1 | 0.996 | 1 | 0.996 | |||
| 400 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 0.982 | 0.982 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 1000 | 1/2 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 1000 | 2/3 | α1 | 1 | 1 | 0.993 | 1 | 0.993 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 0.996 | 1 | 1 | 0.996 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 0.951 | 0.951 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 1/2 | α1 | 1 | 1 | 1 | 0.999 | 0.999 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 400 | 2000 | 2/3 | α1 | 1 | 1 | 0.991 | 1 | 0.991 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 0.986 | 1 | 1 | 0.986 | 1 | 1 | 1 | 1 | 1 | |||
In addition, comparing the two methods with different ρ’s, Tables 1 and 2 show that when ρ increases, the performance of CC-SIS and the new procedure become worse. This is expected because when the predictors are highly correlated, the unimportant predictors may be selected due to their strong correlations with the true predictors.
We also examine the computational efficiency and empirical convergence of the proposed algorithm for VCM. Table 3 shows the medians and median of absolute deviations (MADs) of computing time (seconds), and the number of iterations over 1000 replications. When p = 1000, most of the medians of the computing times are below 5 seconds, and the MAD is pretty small; when p = 2000, the computing time increases, but the medians are still mostly below 9 seconds and the MADs are also small. In general, the algorithm converges faster as the sample size increases. As shown in Table 3, the algorithm converges after 5 iterations when n = 400 and it usually converges after 10 iterations when n = 200. All of the facts above show that the proposed algorithm is reasonably efficient.
Example 3.2. This example is designated to examine the performance of the proposed procedures for binary response. Given {U, x}, we generate a binary response with the probability of Y = 1 being p(U, x) defined below:
| (3.2) |
where logit(t) = log{t/(1 − t)}, the logit link in the logistic regression. Model (3.2) implies that αj(·) = 0 for j > 4 and . In this example, the coefficients are set to be the same as those in Example 3.1.
In this example, we consider p = 1000 and 2000, and the sample size n = 300 and 500. All simulation results are based on 1000 replications, and are summarized in Tables 4 and 5.
Table 4:
The proportions of and for binary response
| n | p | ρ | α(·) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 300 | 1000 | 1/3 | α1 | 0.999 | 0.998 | 1 | 1 | 0.997 | 1 | 1 | 0.998 | 0.994 | 0.992 |
| α2 | 0.999 | 1 | 1 | 1 | 0.999 | 1 | 1 | 1 | 1 | 1 | |||
| 300 | 1000 | 1/2 | α1 | 0.983 | 0.987 | 0.987 | 1 | 0.958 | 1 | 1 | 0.984 | 1 | 0.984 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0.996 | 1 | 0.996 | |||
| 300 | 1000 | 2/3 | α1 | 0.925 | 0.928 | 0.946 | 1 | 0.813 | 1 | 1 | 0.896 | 0.996 | 0.894 |
| α2 | 0.995 | 1 | 0.996 | 0.994 | 0.988 | 1 | 0.997 | 0.976 | 1 | 0.973 | |||
| 300 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0.998 | 0.99 | 0.988 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 300 | 2000 | 1/2 | α1 | 0.974 | 0.98 | 0.984 | 1 | 0.941 | 0.998 | 1 | 0.955 | 0.999 | 0.952 |
| α2 | 0.999 | 1 | 1 | 0.998 | 0.997 | 1 | 1 | 0.994 | 1 | 0.994 | |||
| 300 | 2000 | 2/3 | α1 | 0.898 | 0.903 | 0.923 | 1 | 0.75 | 0.998 | 0.999 | 0.821 | 0.994 | 0.816 |
| α2 | 0.991 | 1 | 0.996 | 0.99 | 0.979 | 1 | 0.99 | 0.952 | 1 | 0.943 | |||
| 500 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 1000 | 1/2 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 1000 | 2/3 | α1 | 0.998 | 0.998 | 0.998 | 1 | 0.994 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 2000 | 1/2 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 2000 | 2/3 | α1 | 0.987 | 0.995 | 0.998 | 1 | 0.980 | 1 | 1 | 0.998 | 1 | 0.998 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
Table 5:
Computing times (Seconds) and the number of iterations for binary response
| Σ1 | Σ2 | |||||||
|---|---|---|---|---|---|---|---|---|
| α1 | α2 | α1 | α2 | |||||
| ρ | Time | Iterations | Time | Iterations | Time | Iterations | Time | Iterations |
| (n, p) = (300, 1000) | ||||||||
| 1/3 | 15.65(2.51) | 5(1) | 13.18(2.37) | 4(1) | 12.36(1.69) | 4(1) | 14.52(2.62) | 4(0) |
| 1/2 | 17.39(2.56) | 4(0) | 8.17(0.28) | 3(0) | 14.70(2.39) | 4(1) | 14.48(2.67) | 4(0) |
| 2/3 | 15.44(2.39) | 4(0) | 9.19(1.75) | 3(0) | 14.55(1.98) | 4(1) | 16.76(3.19) | 4(1) |
| (n, p) = (300, 2000) | ||||||||
| 1/3 | 23.63(4.09) | 5(1) | 19.80(3.31) | 4(1) | 17.76(3.55) | 4(1) | 16.93(3.21) | 4(1) |
| 1/2 | 17.70(1.08) | 4(0) | 13.54(0.39) | 3(0) | 22.61(4.13) | 5(1) | 18.79(3.60) | 4(1) |
| 2/3 | 16.94(1.94) | 4(0) | 13.46(0.64) | 3(0) | 22.24(3.89) | 5(1) | 21.50(3.56) | 4(1) |
| (n, p) = (500, 1000) | ||||||||
| 1/3 | 75.23(11.43) | 5(0) | 50.36(8.00) | 4(0) | 55.09(8.95) | 5(1) | 55.03(7.53) | 5(1) |
| 1/2 | 64.40(8.98) | 4(0) | 33.64(3.32) | 3(0) | 62.36(8.52) | 5(1) | 56.10(9.03) | 5(1) |
| 2/3 | 55.52(8.34) | 4(0) | 31.63(3.18) | 3(0) | 63.35(8.16) | 5(1) | 56.07(9.19) | 5(1) |
| (n, p) = (500, 2000) | ||||||||
| 1/3 | 112.07(18.07) | 5(0) | 57.70(4.09) | 4(0) | 70.14(12.46) | 5(1) | 71.20(10.52) | 5(1) |
| 1/2 | 75.85(13.67) | 4(0) | 49.28(7.43) | 3(0) | 69.76(11.67) | 5(1) | 70.23(12.71) | 5(1) |
| 2/3 | 78.53(11.51) | 4(0) | 44.31(3.67) | 3(0) | 79.09(13.66) | 5(1) | 72.74(11.21) | 5(1) |
Table 4 shows the values of and for the binary responses. Under the design of Σ1 and α1, X4 is jointly dependent but marginally independent of Y. As shown in Table 4, the values of and are very close to one, which means our method is able to identify the predictor that is jointly important but marginally independent of the response. In general, is the largest and this is because the absolute value of α4(U) is no less than those of the other three coefficient functions, which makes X4 much easier to be identified. If there is no marginal independence, the values of and are very close to one. From the table, we see that the values of are mostly greater than 0.9. In addition, our procedure performs better as the sample size increases and the dimensionality decreases, which is also consistent to the sure screening property of the new method.
Furthermore, comparing the performances of the new procedure under different ρ’s, Table 4 shows that the new procedure performs better as the value of ρ decreases. This is the same as the pattern for Example 3.1.
Table 5 presents the medians and MADs of computing time (seconds) and the number of iterations for binary response over 1000 simulations. In general, the computing time increases as the sample size and the dimension of predictors increases. The algorithm converges in 5 iterations and it is not influenced by the sample sizes and the dimension of the predictors. This implies that the proposed algorithm works well for GVCM with binary response.
Example 3.3. This example is designated to examine the performance of the proposed procedures for GVCM with count response. Given {U, x}, we generate a count response from a Poisson distribution with mean λ(U, x) defined below.
| (3.3) |
Model (3.3) implies that αj(·) = 0 for j > 4 and . In this example, we consider two sets of coefficient functions:
α1: Let , and .
α2: , , , .
That is, we re-scale the α(·)s in Example 3.1 so that their ranges lie between 1 and 1 since the mean function λ(U, x) is in the exponential scale of α(·)s.
In this example, we consider p = 1000 and 2000, and the sample size n = 300, and 500. All the simulation results are based on 1000 replications, and are summarized in Tables 6 and 7.
Table 6:
The proportions of and for count response
| n | p | ρ | α(·) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 300 | 1000 | 1/3 | α1 | 0.982 | 0.976 | 0.978 | 0.983 | 0.942 | 0.998 | 0.998 | 0.983 | 0.989 | 0.975 |
| α2 | 0.998 | 0.999 | 1 | 0.997 | 0.996 | 1 | 0.998 | 0.998 | 0.998 | 0.995 | |||
| 300 | 1000 | 1/2 | α1 | 0.945 | 0.941 | 0.928 | 0.989 | 0.842 | 0.999 | 1 | 0.884 | 0.994 | 0.883 |
| α2 | 0.982 | 0.988 | 0.994 | 0.98 | 0.95 | 1 | 0.981 | 0.979 | 0.999 | 0.968 | |||
| 300 | 1000 | 2/3 | α1 | 0.815 | 0.848 | 0.808 | 0.979 | 0.554 | 0.993 | 0.998 | 0.622 | 0.994 | 0.617 |
| α2 | 0.866 | 0.917 | 0.894 | 0.852 | 0.626 | 1 | 0.825 | 0.793 | 0.997 | 0.703 | |||
| 300 | 2000 | 1/3 | α1 | 0.965 | 0.966 | 0.956 | 0.973 | 0.895 | 0.998 | 1 | 0.966 | 0.97 | 0.955 |
| α2 | 0.987 | 0.994 | 0.997 | 0.989 | 0.976 | 1 | 0.99 | 0.99 | 0.999 | 0.987 | |||
| 300 | 2000 | 1/2 | α1 | 0.897 | 0.895 | 0.88 | 0.994 | 0.739 | 0.996 | 0.997 | 0.811 | 0.991 | 0.806 |
| α2 | 0.962 | 0.982 | 0.985 | 0.964 | 0.909 | 0.999 | 0.95 | 0.938 | 0.997 | 0.913 | |||
| 300 | 2000 | 2/3 | α1 | 0.744 | 0.743 | 0.748 | 0.986 | 0.421 | 0.992 | 0.99 | 0.489 | 0.988 | 0.479 |
| α2 | 0.811 | 0.879 | 0.858 | 0.806 | 0.534 | 1 | 0.694 | 0.676 | 0.995 | 0.54 | |||
| 500 | 1000 | 1/3 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 1000 | 1/2 | α1 | 0.999 | 0.999 | 1 | 1 | 0.998 | 0.999 | 1 | 0.991 | 1 | 0.990 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 1000 | 2/3 | α1 | 0.989 | 0.983 | 0.991 | 1 | 0.965 | 0.999 | 1 | 0.958 | 1 | 0.958 |
| α2 | 0.996 | 1 | 1 | 0.993 | 0.989 | 1 | 0.996 | 0.997 | 1 | 0.994 | |||
| 500 | 2000 | 1/3 | α1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 2000 | 1/2 | α1 | 0.999 | 1 | 0.999 | 1 | 0.998 | 1 | 1 | 0.988 | 1 | 0.988 |
| α2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
| 500 | 2000 | 2/3 | α1 | 0.981 | 0.976 | 0.972 | 1 | 0.933 | 1 | 1 | 0.929 | 1 | 0.929 |
| α2 | 0.988 | 0.995 | 0.996 | 0.994 | 0.974 | 1 | 0.987 | 0.979 | 1 | 0.973 | |||
Table 7:
Computing times (Seconds) and the number of iterations for count response
| Σ1 | Σ2 | |||||||
|---|---|---|---|---|---|---|---|---|
| α1 | α2 | α1 | α2 | |||||
| ρ | Time | Iterations | Time | Iterations | Time | Iterations | Time | Iterations |
| (n, p) = (300, 1000) | ||||||||
| 1/3 | 13.62(2.44) | 4(1) | 11.10(2.10) | 4(1) | 16.17(2.40) | 5(1) | 11.86(2.39) | 4(1) |
| 1/2 | 10.51(2.23) | 4(1) | 12.61(2.03) | 3(1) | 12.90(2.46) | 5(1) | 15.39(2.65) | 5(1) |
| 2/3 | 9.76(0.67) | 3(0) | 11.15(1.51) | 3(0) | 12.84(2.46) | 5(1) | 13.04(2.44) | 5(1) |
| (n, p) = (300, 2000) | ||||||||
| 1/3 | 17.24(3.16) | 4(1) | 18.50(3.96) | 4(1) | 22.47(3.79) | 5(1) | 20.40(3.48) | 5(1) |
| 1/2 | 17.12(3.23) | 4(1) | 16.64(2.84) | 4(1) | 20.38(3.67) | 5(1) | 20.53(3.61) | 5(1) |
| 2/3 | 13.84(0.62) | 3(0) | 13.67(0.51) | 3(0) | 19.84(3.73) | 5(1) | 21.20(3.98) | 5(1) |
| (n, p) = (500, 1000) | ||||||||
| 1/3 | 56.39(9.94) | 4(1) | 43.94(6.90) | 4(1) | 54.58(8.08) | 5(1) | 63.15(9.99) | 5(1) |
| 1/2 | 43.14(6.40) | 4(0) | 39.69(6.17) | 4(1) | 51.78(9.01) | 5(1) | 52.92(8.86) | 5(1) |
| 2/3 | 47.08(7.45) | 4(1) | 29.25(1.14) | 3(0) | 51.12(9.04) | 5(1) | 52.86(8.80) | 5(1) |
| (n, p) = (500, 2000) | ||||||||
| 1/3 | 77.70(11.08) | 4(1) | 53.43(10.93) | 4(1) | 70.14(12.30) | 5(1) | 71.47(12.31) | 5(1) |
| 1/2 | 61.36(8.73) | 4(0) | 52.00(11.15) | 4(1) | 70.80(12.03) | 5(1) | 74.42(10.20) | 5(1) |
| 2/3 | 50.81(11.06) | 4(1) | 50.32(8.40) | 3(0) | 70.83(11.98) | 5(1) | 76.46(11.58) | 6(1) |
Table 6 shows the values of and for the count responses. In most cases, the values of and are very close to one, regardless of whether there exists the marginal independence. In general, the proposed procedure performs better as the sample size increases and the dimensionality decreases. Similar to those in Examples 3.1 and 3.2, the proposed procedure has a better performance with smaller ρ’s.
Computing time and the number of iterations of the proposed algorithm are summarized in Table 7. Compared with those in Example 3.2 for binary response, the computing time for count response is relatively shorter. In general, the computing times also become larger as n and p increases. The algorithm converges in fewer steps than the binary case.
Example 3.4. This example is designed to examine the performance of HBIC tuning parameter selector. We set n = 500, p = 1000, 2000, Σ = Σ2 with ρ = 0.5 and α = α2 is the coefficient functions. We set Cn = log(log n) in HBIC, and compare the performance of HBIC with those of the AIC and BIC tuning parameter selectors. The following three criteria are used to evaluate the performances:
P: the probability that the true model is selected;
C: the number of correctly selected predictors from four active predictors;
I: the number of predictors incorrectly selected as active ones from all inactive predictors.
The simulation results based on 200 replications are summarized in Table 8.
Table 8:
Comparing AIC, BIC and HBIC (mean and sd)
| Continuous response | Binary response | Count response | |||||
|---|---|---|---|---|---|---|---|
| p=1000 | p=2000 | p=1000 | p=2000 | p=1000 | p=2000 | ||
| AIC | P | 0.100 | 0.060 | 0.055 | 0.020 | 0.420 | 0.370 |
| C | 4(0) | 4(0) | 4(0.100) | 4(0) | 4(0) | 4(0.141) | |
| I | 10.200(7.366) | 9.850(7.262) | 11.425(6.889) | 13.63(6.030) | 1.64(2.242) | 2.030(2.901) | |
| BIC | P | 0.745 | 0.715 | 0.760 | 0.710 | 0.665 | 0.570 |
| C | 4(0) | 4(0) | 4(0.571) | 4(0) | 4(0.262) | 4(0.278) | |
| I | 0.305(0.560) | 0.325(0.549) | 0.300(0.481) | 0.220(0.503) | 0.530(0.956) | 0.720(1.161) | |
| HBIC | P | 0.970 | 0.975 | 0.915 | 0.710 | 0.700 | 0.620 |
| C | 4(0) | 4(0) | 3.73(0.954) | 4(0) | 4(0) | 4(0) | |
| I | 0.030(0.171) | 0.025(0.157) | 0.005(0.171) | 0.320(0.509) | 0.600(1.143) | 0.660(1.002) | |
Table 8 shows that the AIC, BIC and HBIC tuning parameter selectors can reduce model complexity significantly, while retain all active predictors. As seen from Table 8, the HBIC performs much better than the AIC and theBIC in terms of controlling the false positives in linear varying coefficient model. For the HBIC, the probability of obtaining the true model is close to one and the number of false positives is close to zero. For logistic model and Poisson model, the HBIC performs much better than the AIC and the BIC in terms of selecting the true model. The BIC also works well for logistic model and Poisson model, since the probabilities of obtaining the true model are very close to those of the HBIC.
3.2. An application
We illustrate the proposed methodology by an empirical analysis of a subset of data collected the Framingham Heart Study (FHS, for short). See Dawber, et al. (1951) and Jaquish (2007) for details about FHS. The data subset consists of data for 977 subjects. Of interest is to investigate the impact of dynamic genetic effects on obesity. In our analysis, we focus on nonrare SNPs. Here, nonrate SNPs are referred to those SNP whose the minor allele frequency of a SNP is great than 0.05. In our analysis, we include 4395 nonrare SNPs with missing rates being less than 0.02. According to Wikipedia, a BMI equal to or greater than 25 is considered overweight and above 30 is considered obese. Thus, we define the response variable to be 1 if this subject’s BMI is greater than 25 and 0 otherwise. The response variable indeed stands for the status of overweight or obese. The goal is to identify the SNPs strongly associated with the response. To examine the dynamic (age-dependent) effect of SNPs and gender on the response. We consider a logistic varying coefficient models with u being age, and 8791 covariates since for each SNP, both dominant effect and additive effect are considered, in addition to include gender as a covariate in our analysis. This leads to high-dimensional logistic varying coefficient model with the sample size n = 977.
We first apply the proposed screening procedure to the logistic varying coefficient model with the number of knots being . Note that the gender variable is not subject to screening. Thus, there are total 29 variables after screening.
We further apply group lasso to the model obtained from the screening procedure. HBIC is used to select the tuning parameter. The lasso-HBIC selects a model with 20 SNPs. Figure 1 depicts the plots of the estimated coefficient functions along with their pointwise confidence intervals for the model selected by lasso-HBIC. From Figure 1, it can be seen that the intercept function changes over age, and coefficient functions of some SNP are also changing over age too, although they hover around zero.
Figure 1.

Estimated Coefficient Functions for selected by the HBIC tuning parameter selector.
4. Discussions
In this work, we proposed a SJS feature screening procedure for GVCM with ultrahigh dimensional covariates. The proposed SJS is distinguished from the existing SIS in that the SJS is based on the joint likelihood of potential candidate features. We proposed an effective algorithm to carry out the feature screening procedure, and show that the proposed algorithm possesses an ascent property. We study the sample property of SJS, and establish the sure screening property for SJS. We also conduct numerical study to access the empirical performance of the proposed procedure. The numerical results implies that the proposed algorithm converges quickly and computing time is reasonable.
Acknowledgements
Guangren Yang’s research was supported by the National Nature Science Foundation of China grant 11471086, the National Social Science Foundation of China grant 16BTJ032, the Fundamental Research Funds for the Central Universities 15JNQM019, the National Statistical Scientific Research Center Projects 2015LD02, Education Bureau of Guangdong Province 2016WTSCX007 and Science and Technology Program of Guangzhou 2016201604030074. Songshan Yang’s research was supported by a NIDA, NIH grant P50 DA039838 and a NSF grant DMS 1512422 and Li’s research was supported by was supported by NIDA, NIH grants P50 DA039838, P50 DA036107 and a NSF grant DMS 1512422. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.
Appendix
Proof of Theorem 1. It follows by the Taylor expansion for the quasi-likelihood function at β lying within a neighbor of γ that
where lies between γ and β. For term, we have
where W (β) is a block diagonal matrix with Wj(β) being a dnj × dnj matrix. Since is non-negative definite, Thus, if
then
Thus it follows that and by the definition of h(γ, β). The solution of is . Hence, under the conditions of Theorem 1, it follows that
The second inequality is due to the fact that , and subject to . By definition of , and . This proves Theorem 1. □
Proof of Theorem 2. For a given model s, a subset of {1, … , p}, let be the unrestricted maximum likelihood estimation of based on the spline approximation. It suffices to show that
| (A.1) |
as n → ∞. We approximate αj(U) by
| (A.2) |
where , , are basis functions and dn is the number of basis functions, which is allowed to increase with the sample size n.
Let denote all functions that have the form for a given set of basis . For αnj(U), define the approximation error by
Let , and take . Let and . For as s,
where with and , .
For any define . So, we have
where and are two immediate values. Denote
Thus,
For ∆2, by the Cauchy-Schwartz inequality, we have
According to the property of quasi-likelihood, we have
By condition (C6) and Corollary 1 in Wei, Huang, and Li (2011), it follows ∆2 = op(1). Similarly ∆2, we have ∆3 = op(1).
Next, we consider ∆1. By Wedderburn (Part 5, 1974), the quasi-score function of βs is given by
where µ′(t) is the first-order derivative of µ(t). Let be the Hessian matrix of corresponding to βs.
Under (C3), we consider close to such that for some w1, τ1 > 0. Clearly, when n is sufficiently large, falls into a neighborhood of , so that condition (C6) becomes applicable. Thus, it follows by Condition (C6) and the Cauchy-Schwarz inequality that,
we have
| (A.3) |
where is an intermediate value between and . Thus, we have
where
Let such that , and is bounded by constant M under condition (C5). Under Condition (C6), we have maxi . By condition (C3), we have . These conditions give the exponential bounds for sums of bounded variable probability inequality (Lin and Bai, 2009, Page 74), we have
| (A.4) |
where . Also, by the same arguments, we have
| (A.5) |
The inequalities (A.4) and (A.5) imply that,
So, under condition (C4), we have
| (A.6) |
By Condition (C6), is concave in , (A.6) holds for any such that .
For any , let be augmented with zeros corresponding to the elements in (i.e. ). By Condition (C1), it is seen that . Consequently,
So, we have shown that
as n → ∞. The theorem is proved. □
Proof of Theorem 3. According to the definition of HBIC, for any model s, HBIC(τ(s)) ≤ HBIC(q) implies that
| (A.7) |
We show that the probability that (A.7) occurs at any goes to 0. For any , let . To consider those near , we have
for some between and . By Condition (C6),
Therefore,
Hence, for any such that , we have
By (A.4), (A.5) and (A.6), we can get
Now let be augmented with zeros corresponding to the elements in . It can be seen that
by (C3). Therefore, uniformly over and with probability tending to 1,
Hence, the probability that (A.7) occurs at any tends to 0 which is (2.13).
On the other hand, for , let k = τ(s) –q. It suffices to consider a fixed k, since k takes only the values 1,…, m − q. By definition, HBIC(τ (s)) ≤ HBIC(q) if and only if
We show that, uniformly in with τ(s) = k + q, this inequality does not occur. For large n, by condition (C6),
where lies between and . Denote , and define
So, we have
This implies that f(Δ) reaches its maximum at . Thus,
Hence, we show that, uniformly over with τ(s) = k + q,
occurs with diminishing probability. Thus, under conditions (C4) and (C6),
by Markov inequality, for each , we have
the number of models in is lower than pκ, we have shown that
This completes the proof. □
Footnotes
References
- Chen J and Chen Z (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759–771. [Google Scholar]
- Chen J and Chen Z (2012). Extended BIC for Small-n-Large-p Sparse GLM. Statistics Sinica, 22, 555–574. [Google Scholar]
- Cheng M, Honda T, Li J and Peng H (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics, 42, 1819–1849. [Google Scholar]
- Cheng M-Y, Honda T and Zhang J-T (2016). Forward Variable Selection for Sparse Ultra-high Dimensional Varying-coefficient models. Journal of American Statistical Association, 111, 1209–1221. [Google Scholar]
- Chu W, Li R and Reimherr M (2016). Feature Screening for Time Varying-coefficient Models with Ultra-high Dimensional Longitudinal Data. Annals of Applied Statistics, 10, 596–617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cleveland WS, Grasse E, and Shyu WM (1992). Local Regression Models, in Statistical Models in S (eds, Chambers JM and Hastie TJ), Pacific grove CA: Wadswoth & Brooks/Cole, 309–376. [Google Scholar]
- Dawber TR, Meadors GF, and Moore FE Jr. (1951). Epidemiological Approaches to Heart Disease: The Framingham Study, American Journal of Public Health, 41, 279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Boor C (1978). A Practical Guide to Splines, Springer Verlag, Berlin. [Google Scholar]
- Fan J, Feng Y, and Song R (2011). Nonparametric Independence Screening in Sparse Ultra-high Dimensional Additive Models. Journal of the American Statistical Association, 116, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J and Li R (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
- Fan J and Lv J (2008). Sure Independence Screening for Ultra-high Dimensional Feature Space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Samworth R and Wu Y (2009). Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 1829–1853. [PMC free article] [PubMed] [Google Scholar]
- Fan J and Song R (2010). Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Annals of Statistics, 38, 3567–3604. [Google Scholar]
- Fan J, Ma Y and Dai W (2014). Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying-coefficient Models. Journal of the American Statistical Association, 109, 1270–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall P, and Miller H (2009). Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems, Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]
- Hastie TJ and Tibshirani RJ (1993). Varying-coefficient Models. Journal of the Royal Statistical Society, Series B, 55, 757–796. [Google Scholar]
- Jaquish C (2007). The Framingham Heart Study, on Its Way to Becoming the Gold Standard for Cardiovascular Genetic Epidemiology, BMC Medical Genetics, 8, 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li J and Zhang W (2011). A Semiparametric Threshold Model for Censored Longitudinal Data Analysis. Journal of the American Statistical Association, 106, 685–696. [Google Scholar]
- Li G, Peng H, Zhang J and Zhu L-X (2012). Robust Rank Correlation Based Screening. Annals of Statistics, 40, 1846–1877. [Google Scholar]
- Lin ZY and Bai ZD (2009). Probability Inequalities Science Press Beijing. [Google Scholar]
- Liu J, Li R and Wu R (2014). Feature Selection for Varying-coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association, 109, 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J, Zhong W and Li R (2015). A Selective Overview of Feature Screening for Ultra-high Dimensional Data. Science in China Series A: Mathematics, 58, 2033–2054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCullagh P and Nelder JA (1989). Generalized Linear Models, Second Edition.
- Song R, Yi F, and Zou H (2014). On Varying-coefficient Independence Screening for High Dimensional Varying-coefficient Models. Statistica Sinica, 24, 1735–1752. [PMC free article] [PubMed] [Google Scholar]
- Stone CJ (1982). Optimal Global Rates of Convergence for Nonparametric Regression. Annals of Statistics, 10, 1040–1053. [Google Scholar]
- Stone CJ (1985). Additive Regression and Other Nonparametric Models. Annals of Statistics, 13, 689–705. [Google Scholar]
- Tan X, Shiyko M, Li R, Li Y and Dierker L (2012). A Time-varying Effect Model for Intensive Longitudinal Data. Psychological Methods, 17, 61–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
- Wang H (2009). Forward Regression for Ultra-high Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512–1524. [Google Scholar]
- Wang L Kim Y and Li R (2013). Calibrating Nonconvex Penalized Regression in Ultrahigh Dimension. Annals of Statistics, 41, 2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei F, Huang J and Li H (2011). Variable Selection and Estimation in High Dimensional Varying-coefficient Models. Statistica Sinica, 21, 1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wedderburn RWM (1974). Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika, 61, 439–447. [Google Scholar]
- Xia X, Yang H and Li J (2016). Feature Screening for Generalized Varying-coefficient Models with Application to Dichotomous Response. Computational Statistics & Data Analysis, 102, 85–97. [Google Scholar]
- Xu C and Chen J (2014). The Sparse MLE for Ultra-High Dimensional Feature Screening. Journal of the American Statistical Association, 109, 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu L-P, Li L, Li R and Zhu L-X (2011). Model-free Feature Screening for Ultra-high Dimensional data. Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
