Feature Screening in Ultrahigh Dimensional Generalized Varying-coefficient Models

Guangren Yang; Songshan Yang; Runze Li

doi:10.5705/ss.202017.0362

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: Stat Sin. 2020;30:1049–1067. doi: 10.5705/ss.202017.0362

Feature Screening in Ultrahigh Dimensional Generalized Varying-coefficient Models

Guangren Yang ¹, Songshan Yang ², Runze Li ²

PMCID: PMC7516887 NIHMSID: NIHMS1022753 PMID: 32982122

Abstract

Generalized varying coefficient models are particularly useful for examining dynamic effects of covariates on a continuous, binary or count response. This paper is concerned with feature screening for generalized varying coefficient models with ultrahigh dimensional covariates. The proposed screening procedure is based on joint quasi-likelihood of all predictors, and therefore is distinguished from marginal screening procedures proposed in the literature. In particular, the proposed procedure can effectively identify active predictors that are jointly dependent but marginally independent of the response. In order to carry out the proposed procedure, we propose an effective algorithm and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. We examine the finite sample performance of the proposed procedure and compare it with existing ones via Monte Carlo simulations, and illustrate the proposed procedure by a real data example.

Keywords: Generalized varying-coefficient models, ultrahigh dimensional data, variable screening

1. Introduction

Generalized linear models have been well studied in the literature. Variable selection via penalized likelihood has been developed for generalized linear models with large dimensional covariates (Tibshirani,1996; Fan and Li, 2001). Ultrahigh dimensional data have been collected in various research areas such as genome-wide association studies, proteomics studies, finance, tumor classification and biomedical imaging. Variable selection methods based on penalized likelihood may not perform well for ultrahigh dimensional data due to their algorithmic stability, computational cost and statistical accuracy (Fan, et al., 2009). Fan and Lv (2008) advocates a two stage approach: (a) reduce ultrahigh dimensional covariates to large dimensional by filtering out a large number of irrelevant covariates based on a marginal screening procedure, and (b) apply variable selection methods to the reduced model with large dimensional covariates. Fan and Lv (2008) proposed a sure independence screening (SIS) procedure for linear models using Pearson correlation coefficient as the marginal utility and further established the sure screening property of their procedure under Gaussian linear model framework. Hall and Miller (2009) proposed a feature screening procedure for transformation linear model by using generalized correlation and Li, et al. (2012) advocated using rank correlation for screening to deal with heavy-tailed distribution and the presence of outlier. Fan, et al. (2009) proposed a SIS procedure for generalized linear models based on marginal likelihood estimate. More details about these marginal feature screening procedures can be found at the recent review paper on feature screening by Liu, et al. (2015).

Varying coefficient models (VCM) were proposed to deal with “curse of dimensionality” (Cleveland, et al., 1992; Hastie and Tibshirani, 1993). As a natural extension of linear regression models by allowing coefficients varying over a variable such as age and time, the VCM are particularly useful for exploring dynamic pattern of effects and have been used in various research fields (See, e.g., Zhu, et al., 2011; Tan, et al, 2012; Liu, et al, 2014). Feature screening procedures for VCM with ultrahigh dimensional covariates (referred to as ultrahigh dimensional VCM for short) have been proposed in the literature. Liu, et al. (2014) developed an SIS procedure for ultrahigh dimensional VCM by taking conditional Pearson correlation coefficients as marginal utility for ranking importance of predictors. Fan, et al. (2014) proposed an SIS procedure for ultrahigh dimensional VCM by extending B-spline techniques in Fan, et al. (2011) for additive models. Xia, et al. (2016) further extends the SIS procedure proposed in Fan, et al. (2014) to generalized varying coefficient models (GVCM). Cheng, et al. (2016) proposed a forward variable selection procedure for ultrahigh dimensional VCM based on techniques related B-splines regression and grouped variable selection. Song, et al. (2014) extended the proposal of Fan, et al. (2014) for longitudinal data without taking into within-subject correlation, while Chu, et al. (2016) proposed an SIS procedure for longitudinal data based on weighted residual sum of squares to use within-subjection correlation to improve accuracy of feature screening. Although feature screening for ultrahigh dimensional VCM is an active research topic in the literature, there is little work on joint feature screening for ultrahigh dimensional GVCM, which is particularly useful to examine dynamic effects of covariates on a binary, count or continuous response. For example, Li and Zhang (2011) proposed a new semiparametric threshold model for censored longitudinal data analysis. Cheng, et al. (2014) offered a new automatic procedure for finding a sparse semivarying coefficient model, which is widely accepted for longitudinal data analysis. This paper intends to fill this gap.

In this paper, we propose a new feature screening procedure for ultrahigh-dimensional GVCM. The proposed procedure is based on joint likelihood of potential active predictors and therefore is distinguished from the existing SIS procedures (Fan, et al., 2014; Liu, et al., 2014; Xia, et al., 2016) in that the proposed procedure is not a marginal screening procedure. Wang (2009) proposed a forward regression approach to feature screening in ultrahigh dimensional linear models. Cheng, et al. (2016) further extended the forward regression procedure for ultrahigh dimensional VCM based on techniques related B-splines regression and grouped variable selection. Xu and Chen (2014) proposed a feature screening procedure for generalized linear models via the sparsity-restricted maximum likelihood estimator. As demonstrated in Wang (2009), Xu and Chen (2014) and Cheng, et al. (2016), their approaches can perform better than the sure independence screening procedures, and can effectively identify predictors that are jointly dependent but marginal independent of the response. In this paper, we develop a new screening procedure for the ultra-high dimensional GVCM based on joint likelihood of potential active predictors. The proposed procedure can effectively identify active predictors that are jointly dependent but marginal independent of the response without performing an iterative procedure. We develop a computationally effective algorithm to carry out the proposed procedure and establish the ascent property of the proposed algorithm. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to one, the selected variable set includes the actual active predictors. In summary, this work makes the following major contributions to the literature. (a) We propose a sure joint screening (SJS) procedure for ultrahigh dimensional GVCM. We further propose an effective algorithm to carry out the proposed screening procedure, and demonstrate the ascent property of the proposed algorithm. (b) We establish the screening property for the proposed joint screening procedure.

The rest of this paper is organized as follows. In Section 2, we propose a new feature screening for the ultrahigh dimensional GVCM, and develop an effective algorithm for the proposed screening procedure. We further study theoretical properties of the proposed procedure and algorithm. In Section 3, we present numerical comparisons and an empirical analysis of a real data example. Some discussion and conclusion remarks are given in Section 4. Technical proofs are given in the Appendix.

2. New feature screening procedure for generalized varying coefficient models

Let Y be the response variable and {x, U} its associated covariates, where x = (X₁, · · · , X_p) and U be p-dimensional and univariate covariates respectively. Further, let µ(x, U) = E(Y |x, U). The GVCM assumes that

η (x, U) ≙ g {μ (x, U)} = x^{T} α (U),

(2.1)

where g(·) is a known link function and α(·) is a vector consisting of unspecified smooth regression coefficient functions. Here it is assumed that all α_j(·)’s are nonparametric functions and the support of U is finite and denoted by [a, b].

Suppose that {U_i, x_i, Y_i}, i = 1, … , n, constitute an independent and identically distributed sample and that conditionally on {U_i, x_i}, the conditional quasi-likelihood of Y_i is Q{µ(U_i, x_i), Y_i}, where the quasi-likelihood function is defined by $Q (μ, y) = \int_{μ}^{y} \frac{s - y}{V (s)} d s$ , or equivalently $\frac{\partial Q (μ, y)}{\partial μ} = \frac{y - μ}{V (μ)}$ , for a specific variance function V (s). Denote by ℓ{α(·)} the quasi-likelihood (McCullagh and Nelder, 1989) of the collected data {(U_i, x_i, Y_i), i = 1, … , n}. That is

ℓ {α (\cdot)} = \sum_{i = 1}^{n} Q [g^{- 1} {x_{i}^{T} α (U_{i})}; Y_{i}] .

(2.2)

To estimate the nonparametric regression coefficient, we use B-spline regression method. Let $S_{n}$ be the space of polynomial splines of degree $l \geq 1$ and ${ψ_{j k}, k = 1, \dots, d_{n_{j}}}$ denote a normalized B-spline basis with ${‖ ψ_{j k} ‖}_{\infty} \leq 1$ and $d_{n j} = O (n^{1 / 5})$ , where ǁ · ǁ_∞ is the sup norm. For any $α_{n j} \in S_{n}$ , we have

α_{n j} (U) = \sum_{k = 1}^{d_{n_{j}}} β_{j k} ψ_{j k} (U) = β_{j}^{T} ψ_{j} (U), j = 1, \dots, p,

(2.3)

for some coefficients ${β_{j k}}_{k = 1}^{d_{n_{j}}}$ . Here $d_{n_{j}}$ increases with n. We allow $d_{n_{j}}$ to be different for different j since different coefficient functions may have different smoothness. Under some conditions, each nonparametric coefficient function α_j(U), j = 1,···, p can be well approximated by functions in $S_{n}$ .

Substituting (2.3) into (2.2), the maximum quasi-likelihood estimate of (2.2) is to maximize

ℓ (β) ≙ \sum_{i = 1}^{n} Q [g^{- 1} {\sum_{j = 1}^{p} β_{j}^{T} ψ_{j} (U_{i}) X_{i j}}; Y_{i}] = \sum_{i = 1}^{n} Q [g^{- 1} (z_{i}^{T} β); Y_{i}],

(2.4)

with respect to β, where z_i = (X_i1ψ₁(U_i)^T , · · · , X_ipψ_p(U_i)^T)^T and $β = {(β_{1}^{T}, \dots, β_{p}^{T})}^{T}$ . With slight abuse notation, we use ℓ{α(·)} in (2.2) and ℓ(β) in (2.4). However, the notation will be clear in the context. In the presence of ultrahigh dimensional covariate x, the corresponding optimization problem becomes ill-posed. It is typical to assume sparsity. That is, only a few x-covariates are significant, and the others do not have impact on the response. We next propose a feature screening procedure for model (2.1).

2.1. A new feature screening procedure

Denote ${‖ α_{j} (U) ‖}_{2} = {[E α_{j}^{2} (U)]}^{1 / 2}$ , the L₂-norm of α_j(U). For ease of presentation, s denotes an arbitrary subset of ${1, \dots, p}$ , $x_{s} = {x_{j}, j \in s}$ and $α_{s} (U) = {α_{j} (U), j \in s}$ . For a set s, τ (s) stands for the cardinality of s. Suppose the effect of x is sparse, and the true value of α(U) is α*(U), so β is corresponding to β*. Denote $s^{*} = {j : {‖ α_{j} (U) ‖}_{2} > 0}$ . By sparsity, we means that τ (s*) is much less than p. The goal of feature screening is to identify a subset s such that $s^{*} \subset s$ with overwhelming probability and τ (s) is also much less than p. Theoretically we may formulate this problem to be an optimization problem as below:

\max_{α (\cdot)} ℓ {α (\cdot)} subject to τ ({j : {‖ α_{j} (\cdot) ‖}_{2}^{2} > 0}) \leq m,

(2.5)

for a pre-specified m, which is presumed to be much less than p.

When the approximation error is negligible, we construct a feature screening procedure by considering the following maximization problem:

\max_{α_{n} (\cdot)} ℓ {α_{n} (\cdot)} subject to τ ({j : {‖ α_{n j} (\cdot) ‖}_{2}^{2} > 0}) \leq m .

(2.6)

Note that ${‖ α_{n j} (U) ‖}_{2}^{2} = β_{j}^{T} E {ψ_{j} (U) ψ_{j} {(U)}^{T}} β_{j}$ . Under the assumption that $E {ψ_{j} (U) ψ_{j} {(U)}^{T}}$ is finite positive definite for all j = 1,···, p, the maximization problem in (2.6) is equivalent to

\max_{β} ℓ (β) subject to τ ({j : {‖ β_{j} ‖}_{2}^{2} > 0}) \leq m .

(2.7)

For high dimensional problems, it becomes almost impossible to solve the constrained maximization problem (2.7) directly. Alternatively, we consider a proxy of the quasi-likelihood function. It follows by the Taylor expansion for the quasi-likelihood function ℓ(γ) at β lying within a neighbor of γ that

ℓ (γ) \approx ℓ (β) + {(γ - β)}^{T} ℓ^{'} (β) + \frac{1}{2} {(γ - β)}^{T} ℓ^{″} (β) (γ - β),

where $ℓ^{'} (β) = \partial ℓ (γ) / {\partial γ |}_{γ = β}$ and $ℓ^{″} (β) = \partial^{2} ℓ (γ) / {\partial γ \partial γ^{T} |}_{γ = β}$ . Denote $P_{t} = \sum_{j = 1}^{p} d_{n j}$ . If $ℓ^{″} (β)$ is invertible, the computational complexity of calculating the inverse of $ℓ^{″} (β)$ is $O (P_{t}^{3})$ . For large P_t, small n problems (i.e. $P_{t} ≫ n$ ), $ℓ^{″} (β)$ becomes not invertible. Low computational cost is always desirable for feature screening. To cope with singularity of the Hessian matrix and save computational cost, we propose using the following approximation for ℓ″ (γ)

h (γ | β) = ℓ (β) + {(γ - β)}^{T} ℓ^{'} (β) - \frac{u}{2} {(γ - β)}^{T} W (β) (γ - β),

(2.8)

where u is a scaling constant to be specified and W (β) = diag(W₁(β), ···, W_p(β)), a block diagonal matrix with W_j(β) being a $d_{n j} \times d_{n j}$ matrix. Here we allow W (β) to depend on β. This implies that we approximate ℓ″(β) by −uW (β). Throughout this paper, we will use $W_{j} (β) = - \partial^{2} ℓ (β) / \partial β_{j} \partial β_{j}^{T}$ .

It can be seen that h(β|β) = ℓ(β), and under some conditions, h(γ|β) ≤ ℓ(β) for all γ. This ensures the ascent property. See Theorem 1 below for more details. Since W (β) is a block diagonal matrix, h(γ|β) is an additive function of γ_j for any given β. The additivity enables us to have a closed form solution for the following maximization problem

\max_{γ} h (γ | β) subject to τ ({j : {‖ γ_{j} ‖}_{2}^{2} > 0}) \leq m,

(2.9)

for given β and m. Define ${\tilde{γ}}_{j} = β_{j} + u^{- 1} W_{j}^{- 1} (β_{j}) \partial ℓ / \partial β_{j}$ for $j = 1, \dots, p$ , and $\tilde{γ} = {({\tilde{γ}}_{1}^{T}, \dots, {\tilde{γ}}_{p}^{T})}^{T} = β + u^{- 1} W^{- 1} (β) ℓ^{'} (β)$ is the maximizer of h(γ|β). Denote $g_{j} = {\tilde{γ}}_{j}^{T} W_{j} (β_{j}) {\tilde{γ}}_{j}$ for $j = 1, \dots, p$ , and sort g_j so that g₍₁₎ ≥ g₍₂₎ ≥ ··· ≥ g_(p). The solution of maximization problem (2.9) is the hard-thresholding rule defined below

{\tilde{γ}}_{j} = {\tilde{γ}}_{j} I {g_{j} > g_{(m + 1)}} .

This enables us to effectively screen features by using the following algorithm.

Step 1. Set the initial value $β_{j}^{(0)} = 0, j = 1, \dots, p$ .

Step 2. Set t = 0, 1, 2,···, iteratively conduct Step 2a and Step 2b below until the algorithm converges.

Step 2a. Calculate ${\tilde{γ}}_{j}^{(t)} = β_{j}^{(t)} + u_{t}^{- 1} W_{j}^{- 1} (β_{j}) \partial ℓ (β^{(t)}) / α β_{j}$ , and $g_{j}^{(t)} = {{\tilde{γ}}_{j}^{t}}^{T} W_{j} (β^{(t)}) {\tilde{γ}}_{j}^{(t)}$ . Let $g_{(1)}^{(t)} \geq g_{(2)}^{(t)} \geq \dots \geq g_{(p)}^{(t)}$ , the order statistics of $g_{j}^{(t)} s$ . Set $S_{t} = {j : g_{j}^{(t)} \geq g_{(m + 1)}^{(t)}}$ , the nonzero index set.

Step 2b. Update β by $β^{(t + 1)} = {(β_{1}^{(t + 1)}, \dots, β_{p}^{(t + 1)})}^{T}$ as follows. If $j \notin S_{t}$ , set $β_{j}^{(t + 1)} = 0$ , otherwise, set ${β_{j}^{(t + 1)} : j \in S_{t}}$ be the maximum likelihood estimate of the submodel S_t.

Remark: Unlike the screening procedures based on marginal partial likelihood methods, our proposed procedure is to iteratively update β using Step 2. This enables the proposed screening procedure to incorporate correlation information among the predictors through updating $ℓ_{p}^{'} (β)$ and $ℓ_{p}^{″} (β)$ . Thus, the proposed procedure is expected to perform better than the marginal screening procedures when there are some predictors that are marginally independent. Meanwhile, since each iteration in Step 2 can avoid large-scale matrix inversion and, therefore, it can be carried out with low computational costs.

Theorem 1. Let {β^(t)} be the sequence defined in Step 2b in the above algorithm. Denote

ρ^{(t)} = \sup_{β} [λ_{\max} {W^{- 1 / 2} (β^{(t)}) {- ℓ^{″} (β)} W^{- 1 / 2} (β^{(t)})}] .

Here and hereafter λ_max(A) and λ_min(A) stands for the maximal and the minimal eigenvalues of a matrix A, respectively. If u_t ≥ ρ^(t), then

ℓ (β^{(t + 1)}) \geq ℓ (β^{(t)}),

where β^(t+1) is defined in Step 2b in the above algorithm.

Theorem 1 claims the ascent property of the proposed algorithm if u_t is appropriately chosen. That is, the proposed algorithm may improve the current estimate within the feasible region (i.e. $τ ({j : {‖ α_{j} (U) ‖}_{2} > 0}) \leq m$ ), and the resulting estimate in the current step may serve as a refinement of the last step. This theorem also provides us some insights about choosing u_t in practical implementation. For varying coefficient models: $E (Y | U, x) = x^{T} α (U)$ , we may set $ℓ {α (\cdot)} = - 2^{- 1} \sum_{i = 1}^{n} {Y_{i} - x_{i} α (U_{i})}^{2}$ . In this case $ℓ (β)$ in (2.4) is $ℓ (β) = - 2^{- 1} \sum_{i = 1}^{n} {(Y_{i} - z_{i}^{T} β)}^{2}$ . Thus, $- ℓ^{″} (β) = \sum_{i = 1}^{n} z_{i} z_{i}^{T} = Z^{T} Z$ , where Z is $n \times p_{t}$ matrix with i-th row being $z_{i}^{T}$ . Thus,

ρ^{(t)} = λ_{\max} (diag {(Z^{T} Z)}^{- 1 / 2} (Z^{T} Z) diag {(Z^{T} Z)}^{- 1 / 2}),

which does not depend on the step of iteration t. If z_i’s are marginally standardized so that its marginal sample mean and sample standard deviation equal 0 and 1, respectively, then diag(Z^T Z)^−1/2(Z^T Z)diag(Z^T Z)^−1/2 is the corresponding sample correlation matrix of z_i’s. Thus, ρ is the largest eigenvalue of the sample correlation matrix.

2.2. Sure screening property

For a subset s of {1, … , p} with size τ (s), recall notation x_s = {x_j, j ∈ s} and associated coefficients α_s(U) = {α_j(U), j ∈ s} corresponding to $β_{s} = {β_{j}, j \in s}$ with $β_{j} = (β_{j 1}, \dots, β_{j d_{n j}})$ . We denote the true model by $s^{*} = {j : E α_{j}^{2} (U) > 0, 1 \leq j \leq p}$ with $τ (s^{*}) = q$ . The objective of feature selection is to obtain a subset $\hat{s}$ such that $s^{*} \subset \hat{s}$ with very high probability.

We now provide some theoretical justifications for the screening procedure for the GVCM. The sure screening property (Fan and Lv, 2008)) is referred to as

P r (s^{*} \subset \hat{s}) \to 1, a s n \to \infty .

(2.10)

To establish this sure screening property for the proposed feature screening method, we introduce some additional notations as follows. For any model s, let $ℓ^{'} (β_{s}) = \partial ℓ (β_{s}) / \partial β_{s}$ and $\partial^{″} (β_{s}) = \partial^{2} ℓ (β_{s}) / \partial β_{s} \partial β_{s}^{T}$ be the score function and the Hessian matrix of ℓ(·) as a function of β_s, respectively. Assume that a screening procedure retains m out of p features such that τ (s*) = q < m. So, we define

S_{+}^{m} = {s : s^{*} \subset s; {‖ s ‖}_{0} \leq m} and S_{-}^{m} = {s : s^{*} ⊄ s; {‖ s ‖}_{0} \leq m}

(2.11)

as the collections of the over-fitted models and the under-fitted models. We investigate the asymptotic properties of ${\hat{β}}_{m}$ under the scenario where p, q, m and β* are allowed to depend on the sample size n. We impose the following conditions, some of which are purely technical and only serve to facilitate theoretical understanding of the proposed feature screening procedure.

(C1) The support of U is bounded and is assumed to be [a, b].

(C2) The functions ${α_{j} (U)}_{j = 1}^{p}$ belong to a class of functions $F$ , whose rth derivative $α_{j}^{(r)}$ exists and is Lipschitz of order η,

F = {α_{j} (\cdot) : | α_{j}^{(r)} (s) - α_{j}^{(r)} (t) | \leq K {| s - t |}^{η} for s, t \in [a, b]},

for some positive constant K, where r is a nonnegative integer and η ∈ (0, 1] such that υ = r + η > 0.5.

(C3) There exists w₁, w₂ > 0 and for some non-negative constants τ₁, τ₂ such that τ₁ + τ₂ < 1/2 with

\min_{j \in s^{*}} {‖ α_{j} (U) ‖}_{2} \geq w_{1} n^{- τ_{1}} and q < m \leq w_{2} n^{τ_{2}} .

(C4) log p = O(n^κ) for some 0 ≤ κ < 1 − 2(τ₁ + τ₂).

(C5) µ′(·)/V (·) is bounded by some constant M > 0.

(C6) There exist constants C₁, C₂ > 0, δ > 0, such that for sufficiently large n,

C_{1} d_{n}^{- 1} \leq λ_{\min} [- n^{- 1} ℓ^{″} (β_{s})] \leq λ_{\max} [- n^{- 1} ℓ^{″} (β_{s})] \leq C_{2} d_{n}^{- 1},

for $β_{s} \in {β : {‖ β_{s} - β_{s}^{*} ‖}_{2} \leq δ}$ and $s \in S_{+}^{2 m}$ , where λ_min[ ] and λ_max[ ] denote the smallest and largest eigenvalues of a matrix.

Under Conditions (C1) and (C2), the following two properties of B-splines are valid.

(de Boor, 1978) For $k = 1, \dots, d_{n}$ , $ψ_{j k} (U) \geq 0$ and $\sum_{k = 1}^{d_{n}} ψ_{j k} (U) = 1$ , $U \in [a, b]$ . In addition, there exist positive constants C₃ and C₄ such that $C_{3} d_{n}^{- 1} \leq E ψ_{j k}^{2} (U) \leq C_{4} d_{n}^{- 1}$ .
(Stone, 1982, 1985) If {α_j, j = 1, 2,···, p}is a set of functions in $F$ described in condition (C2), there exists a positive constant C₅ that does not depend on α_j(U) so that the uniform approximation error has the following bound. $ρ = \sup_{U \in [a, b]} {‖ α_{j} (U) - α_{n j} (U) ‖}_{2} \leq C_{5} d_{n}^{- v}$ , $\forall j$ , as $d_{n} \to \infty$ .

Conditions (C1) and (C2) ensure properties (a) and (b), which are required for the B-spline approximation and establishing the sure screening properties.

Note that ${‖ α_{n j} (U) ‖}_{2}^{2} = β_{j}^{T} E {ψ_{j} (U) ψ_{j} {(U)}^{T}} β_{j}$ , based on the properties (a), (b) and Condition (C3), we can derive that

\min_{j \in s^{*}} {‖ β_{j} ‖}_{2} \geq w_{1} d_{n} n^{- τ_{1}} .

Condition (C3) states a few requirements for establishing the sure screening property of the proposed procedure. The first one is the sparsity of β* which makes the sure screening possible with $τ (\hat{s}) = m > q$ . Condition (C3) requires that the signal of the active components $({‖ α_{j} (U) ‖}_{2}, j \in s^{*})$ does not vanish. This is referred to as minimal signal condition in the literature. Minimal signal condition is a commonly-imposed assumption in existing work on marginal feature screening for other model (e.g, Liu, et al., 2014). By (2.12), it is equivalent to requiring that the minimal component in β* does not degenerate too fast, so that the signal is detectable in the asymptotic sequence. Condition (C4) has p diverge with n at up to an exponential rate. Meanwhile, together with (C6), it confines an appropriate order of m that guarantees the identifiability of s* over s for τ (s)≤ m. For varying coefficient model discussed in Section 2.1, Condition (C6) requires

C_{1} d_{n}^{- 1} \leq λ_{\min} [n^{- 1} Z_{s}^{T} Z_{s}] \leq λ_{\max} [n^{- 1} Z_{s}^{T} Z_{s}] \leq C_{2} d_{n}^{- 1},

where Z_s is the corresponding design matrix of model s. We establish the sure screening property of the quasi-likelihood estimation by the following theorem. In Fan and Song (2010), Condition D ensures the tail of the response variable Y to be exponentially light, as shown in the following Lemma 1. As for Condition D corresponds to our Condition (C6), so Condition (C6) can ensure Y bound.

Remark: In particular, our proposed screening procedure is based on joint quasi-likelihood of all predictors. However, Fan, Ma and Dai (2014) investigate marginal nonparametric screening methods to screen variables in sparse ultra-high-dimensional varying-coefficient models. As for conditions (v)-(vi) in Fan, Ma and Dai (2014), conditions (v) and (vi) are requirements for the tail distribution of each covariate, and the noise, to establish the sure screening property. However, errors need to be independent but not normally distributed. Corresponding to our condition (C6), we only need to assume the minimize and maximum eigenvalues of Hessian matrix are bounded.

Theorem 2. Suppose we have n independent observations with p candidate features from model (2.1) and conditions (C1)—(C6) are satisfied. Let $\hat{s}$ be the features obtained by (2.5) of size m. Then, we have

P r (s^{*} \subset \hat{s}) \to 1, a s n \to \infty .

The proof is given in the Appendix. The sure screening property is an appealing property of a screening procedure since it ensures that the true active predictors are retained in the model selected by the screening procedure. We establish the sure screening property under weaker conditions imposed in Fan, et al. (2014) and Xia, et al. (2016).

One has to specify the value of m in practical implementation. As to the choice of m, there are two scenarios. The first one chooses m by a data-driven method that described in Section 2.3. The second one is an ad hoc method. In the literature of feature screening, it is typical to set m = [n/ log(n)] for a parametric model, where [a] indicates the integer part of a (Fan and Lv, 2008). Since we use a linear combination of d_n B-spline bases in our proposed screening procedure for the GVCM, we set m = [(n/d_n)/ log(n/d_n)] throughout in Examples 3.1, 3.2 and 3.3. Although it is an ad hoc choice, it works reasonably well in our numerical examples. With this choice of m, one is ready to further apply existing methods such as the penalized quasi-likelihood method to further remove inactive predictors. To be distinguished from the SIS procedure, the proposed procedure is referred to as sure joint screening (SJS) procedure.

2.3. Choice of m

Feature screening may be used in various contexts. In some contexts, people may treated m as a pre-specified value. For example, due to budget constraint, a biologist may be able to examine up to m genes that potentially associate with a certain phenotype. In other contexts, people may treat m as a tuning parameter to control model complexity. In such cases, it is desirable to develop an automatic data-driven method to determine m. We propose to select m by minimizing the high-dimensional BIC score:

H B I C (m) = - 2 ℓ ({\hat{β}}_{m}) + d_{n} m \frac{C_{n} \log (d_{n} p)}{n},

where ${\hat{β}}_{j} = ({\hat{β}}_{j 1}, \dots, {\hat{β}}_{j d_{n}})$ , $j = 1, \dots, m$ , and C_n is a sequence of numbers that diverges to ∞. Wang, et al. (2013) proposed the HBIC for selecting tuning parameter in the penalized least squares method for high dimensional linear models. Here we modified their proposal for high dimensional generalized varying-coefficient models. In our simulation, we take C_n = log log n, and compare its performance with AIC and BIC tuning parameter selectors defined in the same manner. It is worth to noting that the proposed tuning parameter HBIC selector requires to search over m = 1, 2,· · ·, [n/d_n]. This is distinguished from that the classical AIC and BIC used for subset selection requires to search over subsets. Thus, the tuning parameter selector does not require expensive computational cost.

Recall notation $S_{+}^{m}$ and $S_{-}^{m}$ defined in (2.11). Theorem 3 below shows that the HBIC selects the right model size almost surely.

Theorem 3. Suppose we have n independent observations with p candidate features from model (2.1) and conditions (C3)—(C6) are satisfied. Let sˆ be the features obtained by (2.4) and (2.7) of size m. Then, we have

P r {\min_{s \in S_{-}^{m}} H B I C (τ (s)) \leq H B I C (q)} \to 0,

(2.13)

where q = τ (s*), and

P r {\min_{s \in S_{+}^{m}, s \neq s^{*}} H B I C (τ (s)) \leq H B I C (q)} \to 0.

(2.14)

In Example 3.4, we will examine the performance of the proposed HBIC tuning parameter selector.

3. Numerical studies

In this section, we conduct numerical studies to examine the finite sample performance of the proposed procedures and compare it with the existing ones. All simulation are conducted by using R code. Examples 3.1, 3.2 and 3.3 examine the performance of the proposed screening procedures. Following the literature of feature screening (e.g, Fan and Lv, 2008), we set m = [n/ log(n)] in these examples. Example 3.4 examine the performance of the proposed HBIC, and m is determined by minimizing the HBIC score.

3.1. Simulation studies

In our simulation, the covariate u and x are generated as follows: First draw (U*, x)^T from a p+1 dimensional normal distribution $N (0, Σ)$ , then set U = Φ(U*), where Φ(·) is the cumulative distribution function of N (0, 1). Thus, U follows a uniform distribution U (0, 1) and is correlated with x, and all the predictors X₁, …, X_p are correlated with each other. In our simulation, we consider two scenarios for Σ = (σ_ij)

Σ₁: Compound symmetric correlation structure: σ_ij = 1 if i = j and ρ otherwise.

Σ₂: AR(1) correlation structure: $σ_{i j} = ρ^{| i - j |}$ .

In our numerical studies, we set the number of B-spline basis functions to be $d_{n_{j}} = 5$ , $j = 1, \dots, p$ for each coefficient function. We use the following two criteria to assess the performance of the proposed procedure.

P_a: The proportion of submodels $\hat{M}$ with size d that contain all the true predictors among 1000 simulations.

P_j: The proportion of submodels $\hat{M}$ with size d that contain X_j among 1000 simulations.

Example 3.1. This example is designated to compare the proposed screening procedure with existing SIS procedures for VCM. Since the proposal of Xia, et al. (2016) under the setting of VCM coincides with that in Fan, et al. (2014), which shares the same spirit as that of Liu, et al. (2014), and Song, et al. (2014) and Chu, et al. (2016) were proposed for longitudinal data, we will concentrate on our comparison with CC-SIS proposed by Liu, et al. (2014). Given {U, x}, we generate a continuous response from

Y = α_{1} (U) X_{1} + α_{2} (U) X_{2} + α_{3} (U) X_{3} + α_{4} (U) X_{4} + ε,

(3.1)

where ε ∼ N (0, 1). Model (3.1) implies that α_j(·) = 0 for j > 4 and $M_{*} = {1, 2, 3, 4}$ . We consider two sets of coefficient functions:

α₁: Let $α_{1} (u) = α_{2} (u) = α_{3} (u) = 2 + 2 \sin^{2} (2 π u)$ , and $α_{4} (u) = - 3 ρ * α_{1} (u)$ .

α₂: $α_{1} (u) = - (3 + 2 \cos^{2} (\frac{π}{2} u))$ , $α_{2} (u) = - (3 + 3 u)$ , $α_{3} (u) = {(2 - u)}^{2} + 2$ , $α_{4} (u) = 3 + 2 \sin^{2} (\frac{π}{2} u)$ .

In this example, we consider p = 1000 and 2000, and the sample size n = 200 and 400. All simulation results are based on 1000 replications. Simulation results are summarized in Tables 1 and 3.

Table 1:

The proportions of $P_{j} s$ and $P_{a}$ for continuous response with $\sum = \sum_{1}$

				CC-SIS					New (SJS)

n	p	ρ	α(·)	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$
200	1000	1/3	α₁	1	1	1	0	0	1	1	1	1	1
			α₂	0.995	1	1	0.992	0.987	1	1	1	1	1

200	1000	1/2	α₁	1	1	1	0.015	0.015	1	1	1	1	1
			α₂	994	0.999	0.996	0.979	0.970	1	1	1	1	1

200	1000	2/3	α₁	0.995	0.997	0.995	0.302	0.297	1	1	0.999	1	0.999
			α₂	0.976	0.995	0.984	0.942	0.909	1	1	1	1	1

200	2000	1/3	α₁	1	1	1	0.001	0.001	1	1	1	1	1
			α₂	0.992	0.999	0.998	0.989	0.979	1	1	1	1	1

200	2000	1/2	α₁	0.999	0.997	0.998	0.008	0.008	1	1	1	1	1
			α₂	0.991	0.998	0.994	0.973	0.958	1	1	1	1	1

200	2000	2/3	α₁	0.989	0.987	0.985	0.284	0.274	1	1	0.993	1	0.993
			α₂	0.974	0.999	0.976	0.932	0.892	1	1	1	1	1

400	1000	1/3	α₁	1	1	1	0	0	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	1000	1/2	α₁	1	1	1	0.023	0.023	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	1000	2/3	α₁	1	1	1	0.623	0.623	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	2000	1/3	α₁	1	1	1	0	0	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	2000	1/2	α₁	1	1	1	0.011	0.011	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	2000	2/3	α₁	1	1	1	0.549	0.549	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

Open in a new tab

Table 3:

Computing times (Seconds) and the Number of Iterations for Continuous Response

	Σ₁				Σ₂

	α₁		α₂		α₁		α₂

ρ	Time	Iterations	Time	Iterations	Time	Iterations	Time	Iterations
	(n, p) = (200, 1000)

1/3	3.97(0.17)	10(0)	4.10(0.36)	10(0)	4.13(0.45)	10(0)	3.90(0.20)	10(0)
1/2	4.22(0.24)	10(0)	5.03(0.87)	10(0)	3.98(0.83)	10(0)	4.25(0.37)	10(0)
2/3	3.93(0.11)	10(0)	4.08(0.83)	10(0)	4.25(0.36)	10(0)	4.21(0.32)	10(0)

	(n, p) = (200, 2000)

1/3	7.87(0.47)	10(0)	7.37(0.63)	10(0)	8.04(0.70)	10(0)	7.24(0.20)	10(0)
1/2	7.91(0.59)	10(0)	8.40(0.53)	10(0)	7.98(0.53)	10(0)	7.25(0.21)	10(0)
2/3	7.75(0.61)	10(0)	7.03(0.64)	10(0)	8.05(0.35)	10(0)	7.15(0.39)	10(0)

	(n, p) = (400, 1000)

1/3	2.73(0.37)	5(1)	2.03(0.3)	4(1)	2.98(0.41)	5(1)	2.89(0.46)	5(0)
1/2	2.20(0.21)	4(0)	1.44(0.10)	3(0)	2.91(0.40)	5(1)	2.86(0.46)	5(1)
2/3	1.98(0.30)	4(1)	1.50(0.22)	3(0)	2.42(0.39)	5(1)	2.58(0.33)	5(1)

	(n, p) = (400, 2000)

1/3	4.87(0.67)	5(1)	3.73(0.47)	4(0)	4.87(0.57)	5(1)	6.01(0.98)	5(1)
1/2	3.69(0.29)	4(0)	3.34(0.55)	3(0)	5.97(1.05)	5(1)	6.03(0.93)	5(1)
2/3	3.18(0.43)	4(0)	2.34(0.68)	3(0)	4.67(0.68)	5(1)	6.54(1.72)	5(1)

Open in a new tab

Table 1 shows the values of $P_{1}, \dots, P_{4}$ and $P_{a}$ for continuous response with $Σ = Σ_{1}$ . Under the design of α₁, X₄ is jointly dependent but marginally independent of Y. In this setting, the marginal screening procedure fails to identify X₄. As shown in Table 1, when there exists marginal independence, CC-SIS is unable to detect X₄ whose values of $P_{4}$ and $P_{a}$ are near zero as expected. However, our method can identify X₄ in this setting and the corresponding values of $P_{4}$ and $P_{a}$ are close to one. Therefore, our new procedure outperforms CC-SIS in the presence of marginal independence. Under the design of α₂, there is no predictor that is jointly dependent but marginally independent of Y. Both CC-SIS and the proposed procedure perform very well, as the detecting probabilities are close to one. However, CC-SIS performs better when the sample size increases and the dimensionality decreases. On the other hand, those factors have less influences on the new procedure than CC-SIS. Furthermore, the corresponding values of $P_{j} s$ and $P_{a}$ of our new procedure are closer to one in every case in this setting. In summary, when $Σ = Σ_{1}$ , regardless of whether marginal independence exists, our new procedure outperforms CC-SIS.

Table 2 shows the values of $P_{j} s$ and $P_{a}$ for continuous response with $Σ = Σ_{2}$ . There is no predictor that is jointly dependent but marginally independent of Y. Hence both of the CC-SIS and the new procedure perform well, as most of the values of $P_{a}$ are greater than 0.9. Table 2 also indicates that when the sample size increases and the dimensionality decreases, both CC-SIS and our new procedure perform better. Furthermore, this table also shows that those factors have less effect on our new procedure. For instance, when n = 200, some values of $P_{a}$ obtained by CC-SIS are less than 0.8, but the corresponding values of $P_{a}$ of the new procedure are close to one. Besides, Table 2 shows that the new procedure performs better than CC-SIS in every case, which is consistent with our theoretical analysis since our new procedure has the sure screening property. Hence, our new procedure also outperforms CC-SIS in the setting of $Σ = Σ_{2}$ .

Table 2:

The proportions of $P_{j} s$ and $P_{a}$ for continuous response with $\sum = \sum_{2}$

				CC-SIS					New (SJS)

n	p	ρ	α(·)	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$
200	1000	1/3	α₁	1	1	1	0.644	0.644	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

200	1000	1/2	α₁	1	1	1	0.887	0.887	1	1	1	1	1
			α₂	1	1	0.996	0.999	0.995	1	1	1	1	1

200	1000	2/3	α₁	1	1	0.741	0.990	0.731	1	1	0.952	1	0.952
			α₂	1	0.745	0.999	1	0.744	1	1	0.998	1	0.998

200	2000	1/3	α₁	1	1	1	0.551	0.551	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

200	2000	1/2	α₁	1	1	0.997	0.858	0.855	1	1	1	1	1
			α₂	1	0.991	0.999	1	0.990	1	1	1	1	1

200	2000	2/3	α₁	1	1	0.678	0.991	0.669	1	1	0.903	1	0.903
			α₂	0.999	0.693	0.999	1	0.692	1	1	0.996	1	0.996

400	1000	1/3	α₁	1	1	1	0.982	0.982	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	1000	1/2	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	1000	2/3	α₁	1	1	0.993	1	0.993	1	1	1	1	1
			α₂	1	0.996	1	1	0.996	1	1	1	1	1

400	2000	1/3	α₁	1	1	1	0.951	0.951	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	2000	1/2	α₁	1	1	1	0.999	0.999	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

400	2000	2/3	α₁	1	1	0.991	1	0.991	1	1	1	1	1
			α₂	1	0.986	1	1	0.986	1	1	1	1	1

Open in a new tab

In addition, comparing the two methods with different ρ’s, Tables 1 and 2 show that when ρ increases, the performance of CC-SIS and the new procedure become worse. This is expected because when the predictors are highly correlated, the unimportant predictors may be selected due to their strong correlations with the true predictors.

We also examine the computational efficiency and empirical convergence of the proposed algorithm for VCM. Table 3 shows the medians and median of absolute deviations (MADs) of computing time (seconds), and the number of iterations over 1000 replications. When p = 1000, most of the medians of the computing times are below 5 seconds, and the MAD is pretty small; when p = 2000, the computing time increases, but the medians are still mostly below 9 seconds and the MADs are also small. In general, the algorithm converges faster as the sample size increases. As shown in Table 3, the algorithm converges after 5 iterations when n = 400 and it usually converges after 10 iterations when n = 200. All of the facts above show that the proposed algorithm is reasonably efficient.

Example 3.2. This example is designated to examine the performance of the proposed procedures for binary response. Given {U, x}, we generate a binary response with the probability of Y = 1 being p(U, x) defined below:

logit {p (U, x)} = α_{1} (U) X_{2} + α_{2} (U) X_{2} + α_{3} (U) X_{3} + α_{4} (U) X_{4},

(3.2)

where logit(t) = log{t/(1 − t)}, the logit link in the logistic regression. Model (3.2) implies that α_j(·) = 0 for j > 4 and $M_{*} = {1, 2, 3, 4}$ . In this example, the coefficients are set to be the same as those in Example 3.1.

In this example, we consider p = 1000 and 2000, and the sample size n = 300 and 500. All simulation results are based on 1000 replications, and are summarized in Tables 4 and 5.

Table 4:

The proportions of $P_{j} s$ and $P_{a}$ for binary response

				$\sum = \sum_{1}$					$\sum = \sum_{2}$

n	p	ρ	α(·)	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$
300	1000	1/3	α₁	0.999	0.998	1	1	0.997	1	1	0.998	0.994	0.992
			α₂	0.999	1	1	1	0.999	1	1	1	1	1

300	1000	1/2	α₁	0.983	0.987	0.987	1	0.958	1	1	0.984	1	0.984
			α₂	1	1	1	1	1	1	1	0.996	1	0.996

300	1000	2/3	α₁	0.925	0.928	0.946	1	0.813	1	1	0.896	0.996	0.894
			α₂	0.995	1	0.996	0.994	0.988	1	0.997	0.976	1	0.973

300	2000	1/3	α₁	1	1	1	1	1	1	1	0.998	0.99	0.988
			α₂	1	1	1	1	1	1	1	1	1	1

300	2000	1/2	α₁	0.974	0.98	0.984	1	0.941	0.998	1	0.955	0.999	0.952
			α₂	0.999	1	1	0.998	0.997	1	1	0.994	1	0.994

300	2000	2/3	α₁	0.898	0.903	0.923	1	0.75	0.998	0.999	0.821	0.994	0.816
			α₂	0.991	1	0.996	0.99	0.979	1	0.99	0.952	1	0.943

500	1000	1/3	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	1000	1/2	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	1000	2/3	α₁	0.998	0.998	0.998	1	0.994	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	2000	1/3	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	2000	1/2	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	2000	2/3	α₁	0.987	0.995	0.998	1	0.980	1	1	0.998	1	0.998
			α₂	1	1	1	1	1	1	1	1	1	1

Open in a new tab

Table 5:

Computing times (Seconds) and the number of iterations for binary response

	Σ₁				Σ₂

	α₁		α₂		α₁		α₂

ρ	Time	Iterations	Time	Iterations	Time	Iterations	Time	Iterations
	(n, p) = (300, 1000)

1/3	15.65(2.51)	5(1)	13.18(2.37)	4(1)	12.36(1.69)	4(1)	14.52(2.62)	4(0)
1/2	17.39(2.56)	4(0)	8.17(0.28)	3(0)	14.70(2.39)	4(1)	14.48(2.67)	4(0)
2/3	15.44(2.39)	4(0)	9.19(1.75)	3(0)	14.55(1.98)	4(1)	16.76(3.19)	4(1)

	(n, p) = (300, 2000)

1/3	23.63(4.09)	5(1)	19.80(3.31)	4(1)	17.76(3.55)	4(1)	16.93(3.21)	4(1)
1/2	17.70(1.08)	4(0)	13.54(0.39)	3(0)	22.61(4.13)	5(1)	18.79(3.60)	4(1)
2/3	16.94(1.94)	4(0)	13.46(0.64)	3(0)	22.24(3.89)	5(1)	21.50(3.56)	4(1)

	(n, p) = (500, 1000)

1/3	75.23(11.43)	5(0)	50.36(8.00)	4(0)	55.09(8.95)	5(1)	55.03(7.53)	5(1)
1/2	64.40(8.98)	4(0)	33.64(3.32)	3(0)	62.36(8.52)	5(1)	56.10(9.03)	5(1)
2/3	55.52(8.34)	4(0)	31.63(3.18)	3(0)	63.35(8.16)	5(1)	56.07(9.19)	5(1)

	(n, p) = (500, 2000)

1/3	112.07(18.07)	5(0)	57.70(4.09)	4(0)	70.14(12.46)	5(1)	71.20(10.52)	5(1)
1/2	75.85(13.67)	4(0)	49.28(7.43)	3(0)	69.76(11.67)	5(1)	70.23(12.71)	5(1)
2/3	78.53(11.51)	4(0)	44.31(3.67)	3(0)	79.09(13.66)	5(1)	72.74(11.21)	5(1)

Open in a new tab

Table 4 shows the values of $P_{j} s$ and $P_{a}$ for the binary responses. Under the design of Σ₁ and α₁, X₄ is jointly dependent but marginally independent of Y. As shown in Table 4, the values of $P_{4}$ and $P_{a}$ are very close to one, which means our method is able to identify the predictor that is jointly important but marginally independent of the response. In general, $P_{4}$ is the largest and this is because the absolute value of α₄(U) is no less than those of the other three coefficient functions, which makes X₄ much easier to be identified. If there is no marginal independence, the values of $P_{j} s$ and $P_{a}$ are very close to one. From the table, we see that the values of $P_{a}$ are mostly greater than 0.9. In addition, our procedure performs better as the sample size increases and the dimensionality decreases, which is also consistent to the sure screening property of the new method.

Furthermore, comparing the performances of the new procedure under different ρ’s, Table 4 shows that the new procedure performs better as the value of ρ decreases. This is the same as the pattern for Example 3.1.

Table 5 presents the medians and MADs of computing time (seconds) and the number of iterations for binary response over 1000 simulations. In general, the computing time increases as the sample size and the dimension of predictors increases. The algorithm converges in 5 iterations and it is not influenced by the sample sizes and the dimension of the predictors. This implies that the proposed algorithm works well for GVCM with binary response.

Example 3.3. This example is designated to examine the performance of the proposed procedures for GVCM with count response. Given {U, x}, we generate a count response from a Poisson distribution with mean λ(U, x) defined below.

\log {λ (U, x)} = α_{1} (U) X_{1} + α_{2} (U) X_{2} + α_{3} (U) X_{3} + α_{4} (U) X_{4} .

(3.3)

Model (3.3) implies that α_j(·) = 0 for j > 4 and $M_{*} = {1, 2, 3, 4}$ . In this example, we consider two sets of coefficient functions:

α₁: Let $α_{1} (u) = α_{2} (u) = α_{3} (u) = {2 + 2 \sin^{2} (2 π u)} / 4$ , and $α_{4} (u) = - 0.75 ρ * α_{1} (u)$ .

α₂: $α_{1} (u) = - {3 + 2 \cos^{2} (\frac{π}{2} u)} / 6$ , $α_{2} (u) = - (3 + 3 u) / 6$ , $α_{3} (u) = {{(2 - u)}^{2} + 2} / 6$ , $α_{4} (u) = {3 + 2 \sin^{2} (\frac{π}{2} u)} / 6$ .

That is, we re-scale the α(·)s in Example 3.1 so that their ranges lie between 1 and 1 since the mean function λ(U, x) is in the exponential scale of α(·)s.

In this example, we consider p = 1000 and 2000, and the sample size n = 300, and 500. All the simulation results are based on 1000 replications, and are summarized in Tables 6 and 7.

Table 6:

The proportions of $P_{s}$ and $P_{a}$ for count response

				$\sum = \sum_{1}$					$\sum = \sum_{2}$

n	p	ρ	α(·)	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$	$P_{1}$	$P_{2}$	$P_{3}$	$P_{4}$	$P_{a}$
300	1000	1/3	α₁	0.982	0.976	0.978	0.983	0.942	0.998	0.998	0.983	0.989	0.975
			α₂	0.998	0.999	1	0.997	0.996	1	0.998	0.998	0.998	0.995

300	1000	1/2	α₁	0.945	0.941	0.928	0.989	0.842	0.999	1	0.884	0.994	0.883
			α₂	0.982	0.988	0.994	0.98	0.95	1	0.981	0.979	0.999	0.968

300	1000	2/3	α₁	0.815	0.848	0.808	0.979	0.554	0.993	0.998	0.622	0.994	0.617
			α₂	0.866	0.917	0.894	0.852	0.626	1	0.825	0.793	0.997	0.703

300	2000	1/3	α₁	0.965	0.966	0.956	0.973	0.895	0.998	1	0.966	0.97	0.955
			α₂	0.987	0.994	0.997	0.989	0.976	1	0.99	0.99	0.999	0.987

300	2000	1/2	α₁	0.897	0.895	0.88	0.994	0.739	0.996	0.997	0.811	0.991	0.806
			α₂	0.962	0.982	0.985	0.964	0.909	0.999	0.95	0.938	0.997	0.913

300	2000	2/3	α₁	0.744	0.743	0.748	0.986	0.421	0.992	0.99	0.489	0.988	0.479
			α₂	0.811	0.879	0.858	0.806	0.534	1	0.694	0.676	0.995	0.54

500	1000	1/3	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	1000	1/2	α₁	0.999	0.999	1	1	0.998	0.999	1	0.991	1	0.990
			α₂	1	1	1	1	1	1	1	1	1	1

500	1000	2/3	α₁	0.989	0.983	0.991	1	0.965	0.999	1	0.958	1	0.958
			α₂	0.996	1	1	0.993	0.989	1	0.996	0.997	1	0.994

500	2000	1/3	α₁	1	1	1	1	1	1	1	1	1	1
			α₂	1	1	1	1	1	1	1	1	1	1

500	2000	1/2	α₁	0.999	1	0.999	1	0.998	1	1	0.988	1	0.988
			α₂	1	1	1	1	1	1	1	1	1	1

500	2000	2/3	α₁	0.981	0.976	0.972	1	0.933	1	1	0.929	1	0.929
			α₂	0.988	0.995	0.996	0.994	0.974	1	0.987	0.979	1	0.973

Open in a new tab

Table 7:

Computing times (Seconds) and the number of iterations for count response

	Σ₁				Σ₂

	α₁		α₂		α₁		α₂

ρ	Time	Iterations	Time	Iterations	Time	Iterations	Time	Iterations
	(n, p) = (300, 1000)

1/3	13.62(2.44)	4(1)	11.10(2.10)	4(1)	16.17(2.40)	5(1)	11.86(2.39)	4(1)
1/2	10.51(2.23)	4(1)	12.61(2.03)	3(1)	12.90(2.46)	5(1)	15.39(2.65)	5(1)
2/3	9.76(0.67)	3(0)	11.15(1.51)	3(0)	12.84(2.46)	5(1)	13.04(2.44)	5(1)

	(n, p) = (300, 2000)

1/3	17.24(3.16)	4(1)	18.50(3.96)	4(1)	22.47(3.79)	5(1)	20.40(3.48)	5(1)
1/2	17.12(3.23)	4(1)	16.64(2.84)	4(1)	20.38(3.67)	5(1)	20.53(3.61)	5(1)
2/3	13.84(0.62)	3(0)	13.67(0.51)	3(0)	19.84(3.73)	5(1)	21.20(3.98)	5(1)

	(n, p) = (500, 1000)

1/3	56.39(9.94)	4(1)	43.94(6.90)	4(1)	54.58(8.08)	5(1)	63.15(9.99)	5(1)
1/2	43.14(6.40)	4(0)	39.69(6.17)	4(1)	51.78(9.01)	5(1)	52.92(8.86)	5(1)
2/3	47.08(7.45)	4(1)	29.25(1.14)	3(0)	51.12(9.04)	5(1)	52.86(8.80)	5(1)

	(n, p) = (500, 2000)

1/3	77.70(11.08)	4(1)	53.43(10.93)	4(1)	70.14(12.30)	5(1)	71.47(12.31)	5(1)
1/2	61.36(8.73)	4(0)	52.00(11.15)	4(1)	70.80(12.03)	5(1)	74.42(10.20)	5(1)
2/3	50.81(11.06)	4(1)	50.32(8.40)	3(0)	70.83(11.98)	5(1)	76.46(11.58)	6(1)

Open in a new tab

Table 6 shows the values of $P_{j} s$ and $P_{a}$ for the count responses. In most cases, the values of $P_{j} s$ and $P_{a}$ are very close to one, regardless of whether there exists the marginal independence. In general, the proposed procedure performs better as the sample size increases and the dimensionality decreases. Similar to those in Examples 3.1 and 3.2, the proposed procedure has a better performance with smaller ρ’s.

Computing time and the number of iterations of the proposed algorithm are summarized in Table 7. Compared with those in Example 3.2 for binary response, the computing time for count response is relatively shorter. In general, the computing times also become larger as n and p increases. The algorithm converges in fewer steps than the binary case.

Example 3.4. This example is designed to examine the performance of HBIC tuning parameter selector. We set n = 500, p = 1000, 2000, Σ = Σ₂ with ρ = 0.5 and α = α₂ is the coefficient functions. We set C_n = log(log n) in HBIC, and compare the performance of HBIC with those of the AIC and BIC tuning parameter selectors. The following three criteria are used to evaluate the performances:

P: the probability that the true model is selected;
C: the number of correctly selected predictors from four active predictors;
I: the number of predictors incorrectly selected as active ones from all inactive predictors.

The simulation results based on 200 replications are summarized in Table 8.

Table 8:

Comparing AIC, BIC and HBIC (mean and sd)

		Continuous response		Binary response		Count response

		p=1000	p=2000	p=1000	p=2000	p=1000	p=2000
AIC	P	0.100	0.060	0.055	0.020	0.420	0.370
	C	4(0)	4(0)	4(0.100)	4(0)	4(0)	4(0.141)
	I	10.200(7.366)	9.850(7.262)	11.425(6.889)	13.63(6.030)	1.64(2.242)	2.030(2.901)

BIC	P	0.745	0.715	0.760	0.710	0.665	0.570
	C	4(0)	4(0)	4(0.571)	4(0)	4(0.262)	4(0.278)
	I	0.305(0.560)	0.325(0.549)	0.300(0.481)	0.220(0.503)	0.530(0.956)	0.720(1.161)

HBIC	P	0.970	0.975	0.915	0.710	0.700	0.620
	C	4(0)	4(0)	3.73(0.954)	4(0)	4(0)	4(0)
	I	0.030(0.171)	0.025(0.157)	0.005(0.171)	0.320(0.509)	0.600(1.143)	0.660(1.002)

Open in a new tab

Table 8 shows that the AIC, BIC and HBIC tuning parameter selectors can reduce model complexity significantly, while retain all active predictors. As seen from Table 8, the HBIC performs much better than the AIC and theBIC in terms of controlling the false positives in linear varying coefficient model. For the HBIC, the probability of obtaining the true model is close to one and the number of false positives is close to zero. For logistic model and Poisson model, the HBIC performs much better than the AIC and the BIC in terms of selecting the true model. The BIC also works well for logistic model and Poisson model, since the probabilities of obtaining the true model are very close to those of the HBIC.

3.2. An application

We illustrate the proposed methodology by an empirical analysis of a subset of data collected the Framingham Heart Study (FHS, for short). See Dawber, et al. (1951) and Jaquish (2007) for details about FHS. The data subset consists of data for 977 subjects. Of interest is to investigate the impact of dynamic genetic effects on obesity. In our analysis, we focus on nonrare SNPs. Here, nonrate SNPs are referred to those SNP whose the minor allele frequency of a SNP is great than 0.05. In our analysis, we include 4395 nonrare SNPs with missing rates being less than 0.02. According to Wikipedia, a BMI equal to or greater than 25 is considered overweight and above 30 is considered obese. Thus, we define the response variable to be 1 if this subject’s BMI is greater than 25 and 0 otherwise. The response variable indeed stands for the status of overweight or obese. The goal is to identify the SNPs strongly associated with the response. To examine the dynamic (age-dependent) effect of SNPs and gender on the response. We consider a logistic varying coefficient models with u being age, and 8791 covariates since for each SNP, both dominant effect and additive effect are considered, in addition to include gender as a covariate in our analysis. This leads to high-dimensional logistic varying coefficient model with the sample size n = 977.

We first apply the proposed screening procedure to the logistic varying coefficient model with the number of knots being $d_{n} = 6 \approx 1.5 n^{1 / 5}$ . Note that the gender variable is not subject to screening. Thus, there are total 29 variables after screening.

We further apply group lasso to the model obtained from the screening procedure. HBIC is used to select the tuning parameter. The lasso-HBIC selects a model with 20 SNPs. Figure 1 depicts the plots of the estimated coefficient functions along with their pointwise confidence intervals for the model selected by lasso-HBIC. From Figure 1, it can be seen that the intercept function changes over age, and coefficient functions of some SNP are also changing over age too, although they hover around zero.

Estimated Coefficient Functions for selected by the HBIC tuning parameter selector.

4. Discussions

In this work, we proposed a SJS feature screening procedure for GVCM with ultrahigh dimensional covariates. The proposed SJS is distinguished from the existing SIS in that the SJS is based on the joint likelihood of potential candidate features. We proposed an effective algorithm to carry out the feature screening procedure, and show that the proposed algorithm possesses an ascent property. We study the sample property of SJS, and establish the sure screening property for SJS. We also conduct numerical study to access the empirical performance of the proposed procedure. The numerical results implies that the proposed algorithm converges quickly and computing time is reasonable.

Acknowledgements

Guangren Yang’s research was supported by the National Nature Science Foundation of China grant 11471086, the National Social Science Foundation of China grant 16BTJ032, the Fundamental Research Funds for the Central Universities 15JNQM019, the National Statistical Scientific Research Center Projects 2015LD02, Education Bureau of Guangdong Province 2016WTSCX007 and Science and Technology Program of Guangzhou 2016201604030074. Songshan Yang’s research was supported by a NIDA, NIH grant P50 DA039838 and a NSF grant DMS 1512422 and Li’s research was supported by was supported by NIDA, NIH grants P50 DA039838, P50 DA036107 and a NSF grant DMS 1512422. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.

Appendix

Proof of Theorem 1. It follows by the Taylor expansion for the quasi-likelihood function $ℓ (γ)$ at β lying within a neighbor of γ that

ℓ (γ) = ℓ (β) + {(γ - β)}^{T} ℓ^{'} (β) + \frac{1}{2} {(γ - β)}^{T} ℓ^{″} (\tilde{β}) (γ - β),

where $\tilde{β}$ lies between γ and β. For ${(γ - β)}^{T} ℓ^{″} (\tilde{β}) (γ - β)$ term, we have

{(γ - β)}^{T} {- ℓ^{″} (\tilde{β})} (γ - β) = {(γ - β)}^{T} W^{1 / 2} (β) W^{- 1 / 2} (β) {- ℓ^{″} (\tilde{β})} W^{- 1 / 2} (β) (γ - β) \leq λ_{\max} [W^{- 1 / 2} (β) {- ℓ^{″} (\tilde{β})} W^{- 1 / 2} (β)] {(γ - β)}^{T} W (β) (γ - β),

where W (β) is a block diagonal matrix with W_j(β) being a d_nj × d_nj matrix. Since $- ℓ^{″} (β)$ is non-negative definite, $λ_{\max} [W^{- 1 / 2} (β) {- ℓ^{″} (\tilde{β})} W^{- 1 / 2} (β)] \geq 0$ Thus, if

u > λ_{\max} [W^{- 1 / 2} (β) {- ℓ^{″} (\tilde{β})} W^{- 1 / 2} (β)],

then

ℓ (γ) \geq ℓ (β) + {(γ - β)}^{T} ℓ^{'} (β) - \frac{u}{2} {(γ - β)}^{T} W (β) (γ - β) = h (γ | β) .

Thus it follows that $ℓ (γ) \geq h (γ | β)$ and $ℓ (β) = h (β | β)$ by the definition of h(γ, β). The solution of $\partial h (γ | β) / \partial γ = 0$ is $γ = β + u^{- 1} W (β) ℓ^{'} (β)$ . Hence, under the conditions of Theorem 1, it follows that

ℓ (β^{* (t + 1)}) \geq h (β^{* (t + 1)} | β^{(t)}) \geq h (β^{(t)} | β^{(t)}) = ℓ (β^{(t)}) .

The second inequality is due to the fact that $τ ({j : {‖ β_{j}^{* (t + 1)} ‖}_{2} > 0}) = τ ({j : {‖ β_{j}^{(t)} ‖}_{2} > 0}) = m$ , and $β^{* (t + 1)} = \arg \max γ h (γ | β^{(t)})$ subject to $τ ({j : {‖ γ_{j} ‖}_{2} > 0}) \leq m$ . By definition of $β^{(t + 1)}$ , $ℓ (β^{(t + 1)}) \geq ℓ (β^{* (t + 1)})$ and $τ ({j : {‖ β_{j}^{(t + 1)} ‖}_{2} > 0}) = m$ . This proves Theorem 1. □

Proof of Theorem 2. For a given model s, a subset of {1, … , p}, let ${\hat{α}}_{s} (\cdot)$ be the unrestricted maximum likelihood estimation of $α_{s} (\cdot)$ based on the spline approximation. It suffices to show that

P r [\max_{s \in S_{-}^{m}} ℓ {{\hat{α}}_{s} (U)} \geq \min_{s \in S_{+}^{m}} ℓ {{\hat{α}}_{s} (U)}] \to 0,

(A.1)

as n → ∞. We approximate α_j(U) by

α_{n j} (U) = \sum_{k = 1}^{d_{n}} β_{j k} ψ_{j k} (U) = β_{j}^{T} ψ_{j} (U), j = 1, \dots, p,

(A.2)

where $ψ_{j k} (U)$ , $k = 1, \dots, d_{n}$ , are basis functions and d_n is the number of basis functions, which is allowed to increase with the sample size n.

Let $S_{j}$ denote all functions that have the form $\sum_{k = 1}^{d_{n}} β_{j k} ψ_{j k} (U)$ for a given set of basis ${ψ_{j k}, k = 1, \dots, d_{n}}$ . For α_nj(U), define the approximation error by

ρ_{j} (U) = α_{j} (U) - α_{n j} (U) = α_{j} (U) - \sum_{k = 1}^{d_{n}} β_{j k} ψ_{j k} (U), j = 1, \dots, p .

Let $dist (α_{j} (\cdot), S_{j}) = \inf_{α_{n j} (U) \in S_{j}} \sup_{U \in [a, b]} {‖ ρ_{j} (U) ‖}_{2}$ , and take $ρ = \max_{1 \leq j \leq p} dist (α_{j} (\cdot), S_{j})$ . Let $α_{n} (U) = {(α_{n 1} (U), \dots, α_{n p} (U))}^{T}$ and $α (U) = {(α_{1} (U), \dots, α_{p} (U))}^{T}$ . For as s,

\begin{array}{l} α_{s} (U) = {(\begin{matrix} ψ_{1} (U) \\ ⋱ \\ ψ_{s} (U) \end{matrix})}_{s \times s d_{n}} {(\begin{matrix} β_{1} \\ ⋮ \\ β_{s} \end{matrix})}_{s d_{n} \times 1} + (\begin{matrix} ρ_{1} (U) \\ ⋮ \\ ρ_{s} (U) \end{matrix}) \\ ≙ Ψ_{s} (U) β_{s} + ρ_{s} (U), \end{array}

where $Ψ_{s} (U) = diag (ψ_{1} (U), \dots, ψ_{s} (U))$ with $ψ_{j} (U) = (ψ_{j 1} (U), \dots, ψ_{j d_{n}} (U))$ and $β_{j} = {(β_{j 1}, \dots, β_{j d_{n}})}^{T}$ , $j = 1, \dots, s$ .

For any $s \in S_{-}^{m}$ define $s^{'} = s \cup s^{*} \in S_{+}^{2 m}$ . So, we have

ℓ {α_{s^{'}} (U)} - ℓ {α_{s^{'}}^{*} (U)} = ℓ {Ψ_{s^{'}} (U) β_{s^{'}} + ρ_{s^{'}} (U)} - ℓ {Ψ_{s^{'}} (U) β_{s^{'}}^{*} + ρ_{s^{'}}^{*} (U)} = ℓ {Ψ_{s^{'}} (U) β_{s^{'}}} + ℓ^{'} {Ψ_{s^{'}} (U) {\tilde{β}}_{s^{'}}} ρ_{s^{'}} (U) - ℓ {Ψ_{s^{'}} (U) {\tilde{β}}_{s^{'}}^{*}} ρ_{s^{'}}^{*} (U),

where ${\tilde{β}}_{s^{'}}$ and ${\tilde{β}}_{s^{'}}^{*}$ are two immediate values. Denote

Δ_{1} = ℓ (β_{s^{'}}) - ℓ (β_{s^{'}}^{*}), Δ_{2} = ℓ^{'} ({\tilde{β}}_{s^{'}}) ρ_{s^{'}} (U), Δ_{3} = ℓ^{'} ({\tilde{β}}_{s^{'}}) ρ_{s^{'}} (U) .

Thus,

ℓ {α_{s^{'}} (U)} - ℓ {α_{s^{'}}^{*} (U)} = Δ_{1} + Δ_{2} - Δ_{3} .

For ∆₂, by the Cauchy-Schwartz inequality, we have

E | Δ_{2} | = E | ℓ^{'} ({\tilde{β}}_{s^{'}}) ρ_{s^{'}} (U) | \leq \sqrt{E {‖ ℓ^{'} ({\tilde{β}}_{s^{'}}) ‖}^{2}} \sqrt{E {‖ ρ_{s^{'}} (U) ‖}^{2}} .

According to the property of quasi-likelihood, we have

E {‖ ℓ^{'} ({\tilde{β}}_{s^{'}}) ‖}^{2} = tr E {ℓ^{'} ({\tilde{β}}_{s^{'}}) ℓ^{'} {({\tilde{β}}_{s^{'}})}^{T}} = - tr E ℓ^{″} ({\tilde{β}}_{s^{'}}) .

By condition (C6) and Corollary 1 in Wei, Huang, and Li (2011), it follows ∆₂ = o_p(1). Similarly ∆₂, we have ∆₃ = o_p(1).

Next, we consider ∆₁. By Wedderburn (Part 5, 1974), the quasi-score function of βs is given by

S_{n} (β_{s}) = \frac{\partial ℓ (β_{s})}{\partial β_{s}} = \sum_{i = 1}^{n} \frac{μ^{'} (z_{i s}^{T} β_{s})}{V (z_{i s}^{T} β_{s})} [Y_{i} - E (Y_{i} | z_{i})] z_{i s},

where µ′(t) is the first-order derivative of µ(t). Let $H_{n} (β_{s}) = - \partial^{2} ℓ (β_{s}) / \partial β_{s} \partial β_{s}^{T}$ be the Hessian matrix of $ℓ (β_{s})$ corresponding to β_s.

Under (C3), we consider $β_{s^{'}}$ close to $β_{s^{'}}^{*}$ such that $‖ β_{s^{'}} - β_{s^{'}}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}}$ for some w₁, τ₁ > 0. Clearly, when n is sufficiently large, $β_{s^{'}}$ falls into a neighborhood of $β_{s^{'}}^{*}$ , so that condition (C6) becomes applicable. Thus, it follows by Condition (C6) and the Cauchy-Schwarz inequality that,

we have

Δ_{1} = ℓ (β_{s^{'}}) - ℓ (β_{s^{'}}^{*}) = {[β_{s^{'}} - β_{s^{'}}^{*}]}^{T} S_{n} (β_{s^{'}}^{*}) - (1 / 2) {[β_{s^{'}} - β_{s^{'}}^{*}]}^{T} H_{n} ({\tilde{β}}_{s^{'}}) [β_{s^{'}} - β_{s^{'}}^{*}] \leq {[β_{s^{'}} - β_{s^{'}}^{*}]}^{T} S_{n} (β_{s^{'}}^{*}) - (C_{1} / 2) n d_{n}^{- 1} {‖ β_{s^{'}} - β_{s^{'}}^{*} ‖}_{2}^{2} \leq w_{1} d_{n} n^{- τ_{1}} {‖ S_{n} (β_{s^{'}}^{*}) ‖}_{2} - (C_{1} / 2) d_{n}^{- 1} w_{1}^{2} d_{n}^{2} n^{1 - 2 τ_{1}},

(A.3)

where ${\tilde{β}}_{s^{'}}$ is an intermediate value between $β_{s^{'}}$ and $β_{s^{'}}^{*}$ . Thus, we have

P r {ℓ (β_{s^{'}}) - ℓ (β_{s^{'}}^{*}) \geq 0} \leq P r {{‖ S_{n} (β_{s^{'}}^{*}) ‖}_{2} \geq (C_{1} w_{1} / 2) n^{1 - τ_{1}}} \leq \sum_{j \in s^{'}} P r {S_{n j}^{2} (β_{s^{'}}^{*}) \geq {(2 m)}^{- 1} {(C_{1} w_{1} / 2)}^{2} n^{2 - 2 τ_{1}}},

where

S_{n j} (β_{s^{'}}^{*}) = \sum_{i = 1}^{n} \frac{μ^{'} (z_{i s^{'}}^{T} β_{s^{'}})}{V (z_{i s^{'}}^{T} β_{s^{'}})} [Y_{i} - E (Y_{i} | z_{i})] z_{i j} .

Let $t_{n i} = z_{i j} {(\sum_{i = 1}^{n} z_{i j}^{2})}^{- 1 / 2}$ such that $\sum_{i = 1}^{n} t_{n i}^{2} = 1$ , and $μ^{'} (z_{i s^{'}}^{T} β_{s^{'}}) / V (z_{i s^{'}}^{T} β_{s^{'}})$ is bounded by constant M under condition (C5). Under Condition (C6), we have max_i ${t_{n i}^{2}} = O_{P} (n^{- 1})$ . By condition (C3), we have $m \leq w_{2} n^{τ_{2}}$ . These conditions give the exponential bounds for sums of bounded variable probability inequality (Lin and Bai, 2009, Page 74), we have

P r {S_{n j} (β_{s^{'}}^{*}) \geq (C_{1} w_{1} / 2) {(2 m)}^{- 1 / 2} n^{1 - τ_{1}}} \leq P r {S_{n j} (β_{s^{'}}^{*}) \geq (C_{1} w_{1} / 2) {(2 w_{2})}^{- 1 / 2} n^{- 0.5 τ_{2}} n^{1 - τ_{1}}} \leq P r {\sum_{i = 1}^{n} t_{n i} [Y_{i} - E (Y_{i} | z_{i})] > c n^{0.5 (1 - 2 τ_{1} - τ_{2})}} \leq \exp (- (c^{2} / 2) n^{1 - 2 τ_{1} - τ_{2}}),

(A.4)

where $c = C_{1} w_{1} / (2 M \sqrt{2 w_{2}})$ . Also, by the same arguments, we have

P r {S_{n j} (β_{s^{'}}^{*}) \leq - (C_{1} w_{1} / 2) {(2 m)}^{- 1 / 2} n^{1 - τ_{1}}} \leq \exp (- (c^{2} / 2) n^{1 - 2 τ_{1} - τ_{2}}),

(A.5)

The inequalities (A.4) and (A.5) imply that,

P r {ℓ (β_{s^{'}}) \geq ℓ (β_{s^{'}}^{*})} \leq 4 m \exp (- (c^{2} / 2) n^{1 - 2 τ_{1} - τ_{2}}) .

So, under condition (C4), we have

P r {\max_{s \in S_{-}^{m}} ℓ (β_{s^{'}}) \geq ℓ (β_{s^{'}}^{*})} \leq \sum_{s \in S_{-}^{m}} P r {ℓ (β_{s^{'}}) \geq ℓ (β_{s^{'}}^{*})} \leq 4 m p^{m} \exp {- 0.5 c^{2} n^{1 - 2 τ_{1} - τ_{2}}} = 4 \exp {\log m + m \log p - 0.5 c^{2} n^{1 - 2 τ_{1} - τ_{2}}} \leq 4 \exp {\log w_{2} + τ_{2} \log n + w_{2} n^{τ_{2}} \log p - 0.5 c^{2} n^{1 - 2 τ_{1} - τ_{2}}} = 4 w_{2} \exp {τ_{2} \log n + w_{2} n^{τ_{2}} \log p - 0.5 c^{2} n^{1 - 2 τ_{1} - τ_{2}}} = o (1) as n \to \infty .

(A.6)

By Condition (C6), $ℓ (β_{s^{'}})$ is concave in $β_{s^{'}}$ , (A.6) holds for any $β_{s^{'}}$ such that $‖ β_{s^{'}} - β_{s^{'}}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}}$ .

For any $s \in S_{-}^{m}$ , let ${\overset{⌣}{β}}_{s^{'}}$ be ${\hat{β}}_{s}$ augmented with zeros corresponding to the elements in $s^{'} \ s^{*}$ (i.e. $s^{'} = {s \cup (s^{*} \ s)} \cup (s^{'} \ s^{*})$ ). By Condition (C1), it is seen that ${‖ {\overset{⌣}{β}}_{s^{'}} - β_{s^{'}}^{*} ‖}_{2} = {‖ {\overset{⌣}{β}}_{s^{*} \cup (s^{'} \ s^{*})} - β_{s^{*} \cup (s^{'} \ s^{*})}^{*} ‖}_{2} = {‖ {\overset{⌣}{β}}_{s^{*} \cup (s^{'} \ s^{*})} - β_{s^{*}}^{*} ‖}_{2} \geq {‖ β_{s^{*} \cup (s^{'} \ s^{*})}^{*} - β_{s^{*}}^{*} ‖}_{2} \geq {‖ β_{s^{'} \ s^{*}}^{*} ‖}_{2} = w_{1} d_{n} n^{- τ_{1}}$ . Consequently,

P r {\max_{s \in S_{-}^{m}} ℓ ({\hat{β}}_{s}) \geq \min_{s \in S_{+}^{m}} ℓ ({\hat{β}}_{s})} \leq P r {\max_{s \in S_{-}^{m}} ℓ_{p} ({\overset{⌣}{β}}_{s^{'}}) \geq ℓ_{p} (β_{s^{'}}^{*})} = o (1) .

So, we have shown that

P r [\max_{s \in S_{-}^{m}} ℓ {{\hat{α}}_{s} (U)} \geq \min_{s \in S_{+}^{m}} ℓ {{\hat{α}}_{s} (U)}] \to 0,

as n → ∞. The theorem is proved. □

Proof of Theorem 3. According to the definition of HBIC, for any model s, HBIC(τ(s)) ≤ HBIC(q) implies that

ℓ ({\hat{β}}_{s}) - ℓ ({\hat{β}}_{s^{*}}) \geq d_{n} {τ (s) - q} \frac{C_{n} \log (d_{n} p)}{2 n} \geq - d_{n} q \frac{C_{n} \log (d_{n} p)}{2 n} .

(A.7)

We show that the probability that (A.7) occurs at any $s \in S_{-}^{m}$ goes to 0. For any $s \in S_{-}^{m}$ , let $\tilde{s} = s \cup s^{*}$ . To consider those $β_{\tilde{s}}$ near $β_{\tilde{s}}^{*}$ , we have

ℓ (β_{\tilde{s}}) - ℓ (β_{\tilde{s}}^{*}) = {β_{\tilde{s}} - β_{\tilde{s}}^{*}}^{T} ℓ^{'} (β_{\tilde{s}}^{*}) - \frac{1}{2} {β_{\tilde{s}} - β_{\tilde{s}}^{*}}^{T} [- ℓ^{″} ({\tilde{β}}_{\tilde{s}}^{*})] {β_{\tilde{s}} - β_{\tilde{s}}^{*}},

for some ${\tilde{β}}_{\tilde{s}}^{*}$ between $β_{\tilde{s}}$ and $β_{\tilde{s}}^{*}$ . By Condition (C6),

{β_{\tilde{s}} - β_{\tilde{s}}^{*}}^{T} [- ℓ^{″} (β_{\tilde{s}}^{*})] {β_{\tilde{s}} - β_{\tilde{s}}^{*}} \geq C_{1} d_{n}^{- 1} n {‖ β_{\tilde{s}} - β_{\tilde{s}}^{*} ‖}^{2} .

Therefore,

ℓ (β_{\tilde{s}}) - ℓ (β_{\tilde{s}}^{*}) \leq {β_{\tilde{s}} - β_{\tilde{s}}^{*}}^{T} ℓ^{'} (β_{\tilde{s}}^{*}) - \frac{C_{1}}{2} d_{n}^{- 1} n {‖ β_{\tilde{s}} - β_{\tilde{s}}^{*} ‖}^{2} .

Hence, for any $β_{\tilde{s}}$ such that $‖ β_{\tilde{s}} - β_{\tilde{s}}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}}$ , we have

ℓ (β_{\tilde{s}}) - ℓ (β_{\tilde{s}}^{*}) \leq w_{1} d_{n} n^{- τ_{1}} ‖ ℓ^{'} (β_{\tilde{s}}^{*}) ‖ - \frac{C_{1}}{2} d_{n}^{- 1} n {(w_{1} d_{n} n^{- τ_{1}})}^{2} .

By (A.4), (A.5) and (A.6), we can get

P r {\sup_{s \in S_{-}^{m}} ℓ (β_{\tilde{s}}) \geq ℓ (β_{\tilde{s}}^{*})} = o (1) .

Now let ${\overset{⌣}{β}}_{\tilde{s}}$ be ${\hat{β}}_{s}$ augmented with zeros corresponding to the elements in $\tilde{s} \ s$ . It can be seen that

‖ {\overset{⌣}{β}}_{\tilde{s}} - β_{\tilde{s}}^{*} ‖ \geq ‖ β_{s^{*} \ s}^{*} ‖ = w_{1} d_{n} n^{- τ_{1}},

by (C3). Therefore, uniformly over $s \in S_{-}^{m}$ and with probability tending to 1,

P r {\sup_{s \in S_{-}^{m}} ℓ ({\hat{β}}_{\tilde{s}}) \geq ℓ (β_{\tilde{s}}^{*})} \leq P r {\sup_{s \in S_{-}^{m}} ℓ ({\overset{⌣}{β}}_{\tilde{s}}) \geq ℓ (β_{\tilde{s}}^{*})} = o (1) .

Hence, the probability that (A.7) occurs at any $s \in S_{-}^{m}$ tends to 0 which is (2.13).

On the other hand, for $s \in S_{+}^{m}$ , let k = τ(s) –q. It suffices to consider a fixed k, since k takes only the values 1,…, m − q. By definition, HBIC(τ (s)) ≤ HBIC(q) if and only if

ℓ ({\hat{β}}_{s}) - ℓ ({\hat{β}}_{s^{*}}) \geq k d_{n} \frac{C_{n} \log (d_{n} p)}{2 n} .

We show that, uniformly in $s \in S_{+}^{m}$ with τ(s) = k + q, this inequality does not occur. For large n, by condition (C6),

ℓ ({\hat{β}}_{s}) - ℓ ({\hat{β}}_{s^{*}}) \leq ℓ ({\hat{β}}_{s}) - ℓ (β_{s}^{*}) \leq {{\hat{β}}_{s} - β_{s}^{*}}^{T} ℓ^{'} (β_{s}^{*}) - \frac{1}{2} {{\hat{β}}_{s} - β_{s}^{*}}^{T} - [- ℓ^{″} ({\tilde{β}}_{s}^{*})] {{\hat{β}}_{s} - β_{s}^{*}} \leq {{\hat{β}}_{s} - β_{s}^{*}}^{T} ℓ^{'} (β_{s}^{*}) - \frac{1}{2} C_{1} d_{n}^{- 1} n {{\hat{β}}_{s} - β_{s}^{*}}^{T} {{\hat{β}}_{s} - β_{s}^{*}} .

where ${\tilde{β}}_{s}^{*}$ lies between ${\hat{β}}_{s}$ and ${\hat{β}}_{s}^{*}$ . Denote $Δ = {\hat{β}}_{s} - β_{s}^{*}$ , and define

f (Δ) = Δ^{T} ℓ^{'} (β_{s}^{*}) - \frac{1}{2} C_{1} d_{n}^{- 1} n Δ^{T} Δ .

So, we have

\frac{\partial f (Δ)}{\partial Δ} = ℓ^{'} (β_{s}^{*}) - C_{1} d_{n}^{- 1} n Δ = 0.

This implies that f(Δ) reaches its maximum at $Δ = d_{n} ℓ^{'} ({\hat{β}}_{s}^{*}) / (C_{1} n)$ . Thus,

ℓ ({\hat{β}}_{s}) - ℓ ({\hat{β}}_{s^{*}}) \leq \frac{1}{2} {(C_{1} n d_{n}^{- 1})}^{- 1} ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*}) .

Hence, we show that, uniformly over $s \in S_{+}^{m}$ with τ(s) = k + q,

\frac{1}{2} {(C_{1} n d_{n}^{- 1})}^{- 1} ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*}) \geq k d_{n} \frac{C_{n} \log (d_{n} p)}{2 n},

occurs with diminishing probability. Thus, under conditions (C4) and (C6),

by Markov inequality, for each $s \in S_{+}^{m}$ , we have

P r [\frac{1}{2} {(C_{1} n d_{n}^{- 1})}^{- 1} ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*}) \geq k d_{n} \frac{C_{n} \log (d_{n} p)}{2 n}] = P r [ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*}) \geq C_{1} k C_{n} \log (d_{n} p)] \leq \frac{E [ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*})]}{C_{1} k C_{n} \log (d_{n} p)} = \frac{E [ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*})]}{C_{1} k C_{n} (\log (d_{n}) + n^{κ})} \to 0.

the number of models in $S_{+}^{m}$ is lower than p^κ, we have shown that

P r [\frac{1}{2} {(C_{1} n d_{n}^{- 1})}^{- 1} ℓ^{'} {(β_{s}^{*})}^{T} ℓ^{'} (β_{s}^{*}) \geq k d_{n} \frac{C_{n} \log (d_{n} p)}{2 n}, \forall s \in S_{+}^{m}] \to 0,

This completes the proof. □

Footnotes

Supplementary Materials

Supplementary materials include Proofs of Theorem 1–3 in Section 2, Table 1–8 in Sections 3 and Figure 1 in Section 3.

References

Chen J and Chen Z (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759–771. [Google Scholar]
Chen J and Chen Z (2012). Extended BIC for Small-n-Large-p Sparse GLM. Statistics Sinica, 22, 555–574. [Google Scholar]
Cheng M, Honda T, Li J and Peng H (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics, 42, 1819–1849. [Google Scholar]
Cheng M-Y, Honda T and Zhang J-T (2016). Forward Variable Selection for Sparse Ultra-high Dimensional Varying-coefficient models. Journal of American Statistical Association, 111, 1209–1221. [Google Scholar]
Chu W, Li R and Reimherr M (2016). Feature Screening for Time Varying-coefficient Models with Ultra-high Dimensional Longitudinal Data. Annals of Applied Statistics, 10, 596–617. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cleveland WS, Grasse E, and Shyu WM (1992). Local Regression Models, in Statistical Models in S (eds, Chambers JM and Hastie TJ), Pacific grove CA: Wadswoth & Brooks/Cole, 309–376. [Google Scholar]
Dawber TR, Meadors GF, and Moore FE Jr. (1951). Epidemiological Approaches to Heart Disease: The Framingham Study, American Journal of Public Health, 41, 279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Boor C (1978). A Practical Guide to Splines, Springer Verlag, Berlin. [Google Scholar]
Fan J, Feng Y, and Song R (2011). Nonparametric Independence Screening in Sparse Ultra-high Dimensional Additive Models. Journal of the American Statistical Association, 116, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]
Fan J and Lv J (2008). Sure Independence Screening for Ultra-high Dimensional Feature Space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Samworth R and Wu Y (2009). Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 1829–1853. [PMC free article] [PubMed] [Google Scholar]
Fan J and Song R (2010). Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Annals of Statistics, 38, 3567–3604. [Google Scholar]
Fan J, Ma Y and Dai W (2014). Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying-coefficient Models. Journal of the American Statistical Association, 109, 1270–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hall P, and Miller H (2009). Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems, Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]
Hastie TJ and Tibshirani RJ (1993). Varying-coefficient Models. Journal of the Royal Statistical Society, Series B, 55, 757–796. [Google Scholar]
Jaquish C (2007). The Framingham Heart Study, on Its Way to Becoming the Gold Standard for Cardiovascular Genetic Epidemiology, BMC Medical Genetics, 8, 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J and Zhang W (2011). A Semiparametric Threshold Model for Censored Longitudinal Data Analysis. Journal of the American Statistical Association, 106, 685–696. [Google Scholar]
Li G, Peng H, Zhang J and Zhu L-X (2012). Robust Rank Correlation Based Screening. Annals of Statistics, 40, 1846–1877. [Google Scholar]
Lin ZY and Bai ZD (2009). Probability Inequalities Science Press Beijing. [Google Scholar]
Liu J, Li R and Wu R (2014). Feature Selection for Varying-coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association, 109, 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu J, Zhong W and Li R (2015). A Selective Overview of Feature Screening for Ultra-high Dimensional Data. Science in China Series A: Mathematics, 58, 2033–2054. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCullagh P and Nelder JA (1989). Generalized Linear Models, Second Edition.
Song R, Yi F, and Zou H (2014). On Varying-coefficient Independence Screening for High Dimensional Varying-coefficient Models. Statistica Sinica, 24, 1735–1752. [PMC free article] [PubMed] [Google Scholar]
Stone CJ (1982). Optimal Global Rates of Convergence for Nonparametric Regression. Annals of Statistics, 10, 1040–1053. [Google Scholar]
Stone CJ (1985). Additive Regression and Other Nonparametric Models. Annals of Statistics, 13, 689–705. [Google Scholar]
Tan X, Shiyko M, Li R, Li Y and Dierker L (2012). A Time-varying Effect Model for Intensive Longitudinal Data. Psychological Methods, 17, 61–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]
Wang H (2009). Forward Regression for Ultra-high Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512–1524. [Google Scholar]
Wang L Kim Y and Li R (2013). Calibrating Nonconvex Penalized Regression in Ultrahigh Dimension. Annals of Statistics, 41, 2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei F, Huang J and Li H (2011). Variable Selection and Estimation in High Dimensional Varying-coefficient Models. Statistica Sinica, 21, 1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wedderburn RWM (1974). Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika, 61, 439–447. [Google Scholar]
Xia X, Yang H and Li J (2016). Feature Screening for Generalized Varying-coefficient Models with Application to Dichotomous Response. Computational Statistics & Data Analysis, 102, 85–97. [Google Scholar]
Xu C and Chen J (2014). The Sparse MLE for Ultra-High Dimensional Feature Screening. Journal of the American Statistical Association, 109, 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu L-P, Li L, Li R and Zhu L-X (2011). Model-free Feature Screening for Ultra-high Dimensional data. Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Chen J and Chen Z (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759–771. [Google Scholar]

[R2] Chen J and Chen Z (2012). Extended BIC for Small-n-Large-p Sparse GLM. Statistics Sinica, 22, 555–574. [Google Scholar]

[R3] Cheng M, Honda T, Li J and Peng H (2014). Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics, 42, 1819–1849. [Google Scholar]

[R4] Cheng M-Y, Honda T and Zhang J-T (2016). Forward Variable Selection for Sparse Ultra-high Dimensional Varying-coefficient models. Journal of American Statistical Association, 111, 1209–1221. [Google Scholar]

[R5] Chu W, Li R and Reimherr M (2016). Feature Screening for Time Varying-coefficient Models with Ultra-high Dimensional Longitudinal Data. Annals of Applied Statistics, 10, 596–617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cleveland WS, Grasse E, and Shyu WM (1992). Local Regression Models, in Statistical Models in S (eds, Chambers JM and Hastie TJ), Pacific grove CA: Wadswoth & Brooks/Cole, 309–376. [Google Scholar]

[R7] Dawber TR, Meadors GF, and Moore FE Jr. (1951). Epidemiological Approaches to Heart Disease: The Framingham Study, American Journal of Public Health, 41, 279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] de Boor C (1978). A Practical Guide to Splines, Springer Verlag, Berlin. [Google Scholar]

[R9] Fan J, Feng Y, and Song R (2011). Nonparametric Independence Screening in Sparse Ultra-high Dimensional Additive Models. Journal of the American Statistical Association, 116, 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fan J and Li R (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348–1360. [Google Scholar]

[R11] Fan J and Lv J (2008). Sure Independence Screening for Ultra-high Dimensional Feature Space (with discussion). Journal of the Royal Statistical Society, Series B, 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fan J, Samworth R and Wu Y (2009). Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 1829–1853. [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J and Song R (2010). Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Annals of Statistics, 38, 3567–3604. [Google Scholar]

[R14] Fan J, Ma Y and Dai W (2014). Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying-coefficient Models. Journal of the American Statistical Association, 109, 1270–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Hall P, and Miller H (2009). Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems, Journal of Computational and Graphical Statistics, 18, 533–550. [Google Scholar]

[R16] Hastie TJ and Tibshirani RJ (1993). Varying-coefficient Models. Journal of the Royal Statistical Society, Series B, 55, 757–796. [Google Scholar]

[R17] Jaquish C (2007). The Framingham Heart Study, on Its Way to Becoming the Gold Standard for Cardiovascular Genetic Epidemiology, BMC Medical Genetics, 8, 63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Li J and Zhang W (2011). A Semiparametric Threshold Model for Censored Longitudinal Data Analysis. Journal of the American Statistical Association, 106, 685–696. [Google Scholar]

[R19] Li G, Peng H, Zhang J and Zhu L-X (2012). Robust Rank Correlation Based Screening. Annals of Statistics, 40, 1846–1877. [Google Scholar]

[R20] Lin ZY and Bai ZD (2009). Probability Inequalities Science Press Beijing. [Google Scholar]

[R21] Liu J, Li R and Wu R (2014). Feature Selection for Varying-coefficient Models with Ultrahigh Dimensional Covariates. Journal of American Statistical Association, 109, 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Liu J, Zhong W and Li R (2015). A Selective Overview of Feature Screening for Ultra-high Dimensional Data. Science in China Series A: Mathematics, 58, 2033–2054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] McCullagh P and Nelder JA (1989). Generalized Linear Models, Second Edition.

[R24] Song R, Yi F, and Zou H (2014). On Varying-coefficient Independence Screening for High Dimensional Varying-coefficient Models. Statistica Sinica, 24, 1735–1752. [PMC free article] [PubMed] [Google Scholar]

[R25] Stone CJ (1982). Optimal Global Rates of Convergence for Nonparametric Regression. Annals of Statistics, 10, 1040–1053. [Google Scholar]

[R26] Stone CJ (1985). Additive Regression and Other Nonparametric Models. Annals of Statistics, 13, 689–705. [Google Scholar]

[R27] Tan X, Shiyko M, Li R, Li Y and Dierker L (2012). A Time-varying Effect Model for Intensive Longitudinal Data. Psychological Methods, 17, 61–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Tibshirani R (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. [Google Scholar]

[R29] Wang H (2009). Forward Regression for Ultra-high Dimensional Variable Screening. Journal of the American Statistical Association, 104, 1512–1524. [Google Scholar]

[R30] Wang L Kim Y and Li R (2013). Calibrating Nonconvex Penalized Regression in Ultrahigh Dimension. Annals of Statistics, 41, 2505–2536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Wei F, Huang J and Li H (2011). Variable Selection and Estimation in High Dimensional Varying-coefficient Models. Statistica Sinica, 21, 1515–1540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Wedderburn RWM (1974). Quasi-likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika, 61, 439–447. [Google Scholar]

[R33] Xia X, Yang H and Li J (2016). Feature Screening for Generalized Varying-coefficient Models with Application to Dichotomous Response. Computational Statistics & Data Analysis, 102, 85–97. [Google Scholar]

[R34] Xu C and Chen J (2014). The Sparse MLE for Ultra-High Dimensional Feature Screening. Journal of the American Statistical Association, 109, 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Zhu L-P, Li L, Li R and Zhu L-X (2011). Model-free Feature Screening for Ultra-high Dimensional data. Journal of the American Statistical Association, 106, 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Feature Screening in Ultrahigh Dimensional Generalized Varying-coefficient Models

Guangren Yang

Songshan Yang

Runze Li

Abstract

1. Introduction

2. New feature screening procedure for generalized varying coefficient models

2.1. A new feature screening procedure

2.2. Sure screening property

2.3. Choice of m

3. Numerical studies

3.1. Simulation studies

Table 1:

Table 3:

Table 2:

Table 4:

Table 5:

Table 6:

Table 7:

Table 8:

3.2. An application

Figure 1.

4. Discussions

Acknowledgements

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Feature Screening in Ultrahigh Dimensional Generalized Varying-coefficient Models

Guangren Yang

Songshan Yang

Runze Li

Abstract

1. Introduction

2. New feature screening procedure for generalized varying coefficient models

2.1. A new feature screening procedure

2.2. Sure screening property

2.3. Choice of m

3. Numerical studies

3.1. Simulation studies

Table 1:

Table 3:

Table 2:

Table 4:

Table 5:

Table 6:

Table 7:

Table 8:

3.2. An application

Figure 1.

4. Discussions

Acknowledgements

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases