Variable selection for partially linear models via partial correlation

Jingyuan Liu; Lejia Lou; Runze Li

doi:10.1016/j.jmva.2018.06.005

. Author manuscript; available in PMC: 2019 Sep 1.

Published in final edited form as: J Multivar Anal. 2018 Jun 20;167:18–434. doi: 10.1016/j.jmva.2018.06.005

Variable selection for partially linear models via partial correlation

Jingyuan Liu ^a, Lejia Lou ^b, Runze Li ^c,^*

PMCID: PMC6555488 NIHMSID: NIHMS1022758 PMID: 31182883

Abstract

The partially linear model (PLM) is a useful semiparametric extension of the linear model that has been well studied in the statistical literature. This paper proposes a variable selection procedure for the PLM with ultrahigh dimensional predictors. The proposed method is different from the existing penalized least squares procedure in that it relies on partial correlation between the partial residuals of the response and the predictors. We systematically study the theoretical properties of the proposed procedure and prove its model consistency property. We further establish the root-n convergence of the estimator of the regression coefficients and the asymptotic normality of the estimate of the baseline function. We conduct Monte Carlo simulations to examine the finite-sample performance of the proposed procedure and illustrate the proposed method with a real data example.

Keywords: Model selection consistency, partial faithfulness, semiparametric regression modeling

1. Introduction

Let y be a response variable, u be a univariate continuous covariate and x = (x₁, …, x_p)^⊤ be a p-dimensional covariate vector. The partially linear model (PLM) assumes that

y = g (u) + x^{⊤} β + ϵ,

(1)

where g is an unspecified baseline function, and β is a vector of unknown regression coefficients. The PLM thus assumes that the regression function linearly depends on the covariates x while depending nonparametrically on u. This model increases the flexibility of linear models by allowing the intercept to be a nonparametric function of the covariate u. It is one of the most popular semiparametric regression models in the literature [14].

This work aims to develop a variable selection procedure for the PLM with ultrahigh dimensional x, i.e, $p = O {exp (n^{a})}$ for some positive constant a, where n is the sample size. PLM estimation has been well studied in the case where p is finite and fixed; see, e.g., [8], [15]. Variable selection procedures have also been developed in this case, e.g., by Fan and Li [6] via penalized least squares, and by Liang and Li [11] who employed the penalized least squares method for variable selection in the PLM in the presence of error in variables. Xie and Huang [18] studied_√ the penalized least squares method with the SCAD penalty [5] for variable selection in the PLM with $p = o (\sqrt{n})$ .

In this paper, we propose a new variable selection procedure for PLM. This procedure differs from the aforementioned penalized least squares methods in that it is a partial correlation learning procedure based on the notion of partial faithfulness that was first advocated by Buhlmann et al. [1] for normal linear models and further used for elliptical linear models in [10]. We first utilize partial residual techniques to eliminate the nonparametric baseline function, and then conduct variable selection by recursively testing the partial correlations between the partial residual of the response and that of the linear covariates. That is, we recursively compare the partial correlations with some threshold values, and therefore refer to this method as the thresholded partial correlation on partial residuals (TPC-PR). Thus, the TPC-PR can be carried out by using the algorithm proposed in [10].

This paper’s main purpose is to study the theoretical properties of the TPC-PR, and to ensure that partial correlation learning works properly for the PLM. Developing the asymptotic theory of the TPC-PR is challenging since we have to deal with the approximation errors due to nonparametric estimation involved in the partial residuals. Furthermore, we need to study the partial faithfulness under the PLM setting without assuming normality. We first establish the concentration inequality of the partial correlations of the partial residuals. We then prove the model selection_√ consistency of the TPC-PR under the PLM with ultrahigh dimensional x. We further establish the $\sqrt{n}$ -consistency of the regression coefficient estimate and the asymptotic normality of the nonparametric baseline estimation.

The rest of the paper is organized as follows. In Section 2, we discuss partial faithfulness under the PLM setting, and systematically study the asymptotic theory of partial correlations between partial residuals. We propose the TPCPR in Section 3, and carefully study its theoretical properties. Section 4 provides the results of Monte Carlo studies and a real data example. Technical proofs are given in Section 5, along with the corresponding regularity conditions to facilitate the proofs. A conclusion is provided in Section 6.

2. Partial faithfulness and partial correlations for the PLM

In model (1), assume that the random error satisfies E(ϵ|u) = E(|x_j) = 0 for all j ∈ {1, …, p} and E(ϵ²) < ∞. The objective is to recover the truly active model $A = {j \in {1, \dots, p} : β_{j} \neq 0}$ with cardinality $| A |$ , as well as to estimate g(u) and the nonzero coefficients in β. As most variable selection procedures do, we impose here a sparsity assumption, namely that $| A | = O (n^{b})$ , where b is defined in Theorems 1 and 2.

2.1. Partial faithfulness in partially linear models

We first discuss a partial faithfulness assumption for the PLM in (1) as a theoretical basis for our proposed variable selection procedure. This assumption, initially formulated by Buhlmann et al. [1], states that if the partial correlation between a given predictor and the response is zero given some other predictors, then the correlation between this predictor and the response is also zero given all other predictors. However, when the nonparametric baseline function is taken into consideration in model (1), this assumption is not directly applicable. Thus we need to apply first a partial residual technique to deal with the nonparametric part. Specifically, note that E(y|u) = g(u) + E(x^⊤|u)β+ E(∊|u).

Model (1) can be written in the form

y * = {x *}^{⊤} β + ϵ *,

(2)

where y* = y−E(y|u) and $x * = {(x_{1}^{*}, \dots, x_{p}^{*})}^{⊤} = x - E (x | u)$ are called partial residuals of y and x on u, and ϵ* = ϵ − E(ϵ|u). It is easy to show that when the covariance matrix of x*, denoted by $Σ_{x *}$ , is positive definite, we have

β = Σ_{x *}^{- 1} cov (y *, {x *}^{⊤}) .

(3)

Therefore, whatever j ∈ {1, …, p},

β_{j} = 0 \Leftrightarrow ρ (y *, x_{j}^{*} | x_{k}^{*}, k \in {j}^{C}) = 0,

(4)

where ρ(z₁,z₂|z₃) represents the partial correlation between z₁ and z₂ after regressing on z₃, and {j}^C = {1, …, j−1, j+ 1, …, p}. This provides the rationale for recovering the nonzero coefficients in $A$ by evaluating partial correlations.

However, the computation of $ρ (y *, x_{j}^{*} | x_{k}^{*}, k \in {j}^{C})$ is infeasible under the high-dimensional setting when p is large. To address this issue, we adapt the concept of partial faithfulness [1] specifically for PLM, based on which we can convert the problem of evaluating $ρ (y *, x_{j}^{*} | x_{k}^{*}, k \in {j}^{C})$ to recursively assess the partial correlations with lower dimensions in a backward direction. The partial faithfulness of PLM is defined as follows.

Definition 1. The partially linear model (1) is said to be partially faithful if for every j ∈ {1, …, p}, $ρ (y *, x_{j}^{*} | x_{S}^{*}) = 0$ for some $S \subset {j}^{C}$ implies that $ρ (y *, x_{j}^{*} | x_{k}^{*}, k \in {j}^{C}) = 0$ , where $x_{j}^{*}$ and y* are defined in model (2), and $x_{S}^{*} = {x_{j}^{*} : j \in S}$ for some index set $S$ .

To fully understand the implications of the partial faithfulness, recall that (4) indicates that for every j ∈ {1, …, p} in model (2), β_j = 0 is equivalent to ${ρ (y *, x_{j}^{*} | x_{k}^{*}, k \in {j}^{C}) = 0}$ . Therefore,

β_{j} = 0 \Leftrightarrow ρ (y *, x_{j}^{*} | x_{S}^{*}) = 0 for some S \subseteq {j}^{C},

or equivalently,

β_{j} \neq 0 \Leftrightarrow ρ (y *, x_{j}^{*} | x_{S}^{*}) \neq 0 for some S \subseteq {j}^{C} .

That is, under the partial faithfulness assumption, the predictor x_j may be removed if there exists one subset $x_{S}^{*}$ such that x*_j is no longer needed when $x_{S}^{*}$ is in the model. When the subset is taken to be empty, the marginal correlation should also be non-trivial. Thus the partial faithfulness rules out the situation where some predictors are marginally uncorrelated with the response, but possess joint effects with other covariates. This coincides with the assumption of any sure screening procedure, initiated by [7].

Lemma 1 below provides sufficient conditions for partial faithfulness of PLM.

Lemma 1. Assume that

(A1) $Σ_{x *}$ is positive definite for all u;

(A2) ${β_{j}; j \in A} ~ f (b) d b$ , where f denotes the density on a subset of $ℝ^{| A |}$ of an absolutely continuous distribution with respect to Lebesgue measure.

Then (x*,y*) satisfies partial faithfulness almost surely with respect to the distribution generating non-zero regression coefficients.

Conditions (A1) and (A2) are inspired by [1]. Condition (A1) guarantees the identifiability of β due to (3). Condition (A2) may be interpreted from a Bayesian point of view. We can treat nonzero β_js as independently and identically distributed random variables from a population with a non-trivial density. This condition is mild in the sense that from a Bayesian perspective, the zero coefficients can arise in an arbitrary fashion. We remark here that though already mild, (A1) and (A2) may not be the weakest conditions to guarantee partial faithfulness.

Based on Lemma 1, in order to identify nonzero β_js, it is sufficient to test recursively the above partial correlations with index set $S$ , with sequentially increasing cardinality $| S |$ . Lemma 1 is a direct corollary from Theorem 1 in [1].

2.2. Asymptotics of sample partial correlations for PLM

The problem of comparing $ρ (y *, x_{j}^{*} | x_{S}^{*})$ with 0 becomes testing the null hypothesis $H_{0} : ρ (y *, x_{j}^{*} | x_{S}^{*}) = 0$ in practice. This requires to study the asymptotic performance of the estimated partial correlations $\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*})$ , which is computed through several estimated conditional means. We first apply local linear regression [3] to estimate E(y|u) and E(x|u) in y* and x* based on the random sample $(u_{1}, x_{1}^{⊤}, y_{1}), \dots, (u_{n}, x_{n}^{⊤}, y_{n})$ . The smoothing matrix S(h) is computed as

S (h) = (\begin{matrix} (1, 0) {Z^{⊤} (u_{1}) W (u_{1}, h) Z (u_{1})}^{- 1} Z^{⊤} (u_{1}) W (u_{1}, h) \\ ⋮ \\ (1, 0) {Z^{⊤} (u_{n}) W (u_{n}, h) Z (u_{n})}^{- 1} Z^{⊤} (u_{n}) W (u_{n}, h) \end{matrix}),

where

Z (u) = (\begin{matrix} 1 & u_{1} - u \\ ⋮ & ⋮ \\ 1 & u_{n} - u \end{matrix}) and W (u, h) = diag {K_{h} (u_{1} - u), \dots, K_{h} (u_{n} - u)},

with K_h(·) = K(·/h)/h, and K being a kernel function with bandwidth h. Then the sample of the partial residuals y* and x* can be obtained by

\hat{y} * = {1 - S (h_{y}) y, \hat{X} * = [{1 - S (h_{1})} X_{1}, \dots, {1 - S (h_{p})} X_{p}]},

(5)

where (X₁, …, X_p) = X = (x₁, …, x_n)^⊤, and y = (y₁, …, y_n)^⊤, h_y and h_j are bandwidths for estimating E(y|u) and E(x₁|u), …, E(x_p|u). The marginal correlations between the partial residuals y* and $x_{1}^{*}, \dots, x_{p}^{*}$ are then estimated by the Pearson correlation between $\hat{y} *$ and each column of $\hat{X} *$ , viz.

\hat{ρ} (y *, x_{j}^{*}) = \frac{〈 {1 - S (h_{y})} y, {1 - S (h_{j})} X_{j} 〉}{‖ {1 - S (h_{y})} y ‖ ‖ {1 - S (h_{j})} X_{j} ‖} .

Following [1], the partial correlations can be computed recursively, for any $k \in S$ , by

\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}) = \frac{\hat{ρ} (y *, x_{j}^{*} | x_{S \ {k}}^{*}) - \hat{ρ} (y *, x_{k}^{*} | x_{S \ {k}}^{*}) \hat{ρ} (x_{j}^{*}, x_{k}^{*} | x_{S \ {k}}^{*})}{{[{1 - {\hat{ρ}}^{2} (y *, x_{k}^{*} | x_{S \ {k}}^{*})} {1 - {\hat{ρ}}^{2} (x_{j}^{*}, x_{k}^{*} | x_{S \ {k}}^{*})}]}^{1 / 2}} .

(6)

Next, we discuss the asymptotic normality of the partial correlations $\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*})$ under a partially linear model setting.

Lemma 2. For any j ∈ {1, …, p} and $S \subset {j}^{C}$ , under regularity conditions (B1)–(B8) in Appendix A, we have

\sqrt{n} {(\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}) - ρ (y *, x_{j}^{*} | x_{S}^{*})} ⇝ N [0, (1 + κ) {1 - ρ^{2} (y *, x_{j}^{*} | x_{S}^{*})}^{2}],

where κ is the marginal kurtosis for distribution of (u, x, y), and $⇝$ denotes convergence in distribution.

3. Variable selection via partial correlations of partial residuals

Lemma 1 states that identifying nonzero coefficients is equivalent to recursively testing $H_{0} : ρ (y *, x_{j}^{*} | x_{S}^{*}) = 0$ for different $S$ . Lemma 2 further implies conducting Z test based on the asymptotically normal distribution of $\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*})$ . We claim β_j = 0 and delete x_j if only one $S \in {j}^{C}$ can be found such that $H_{0}$ cannot be rejected for this $S$ .

Specifically, first set $S \in \emptyset$ , and by Lemma 1, we can delete x_j if $H_{0}$ is not rejected for $x_{j}^{*}$ and $x_{S}^{*} = \emptyset$ , and obtain the first-step active set ${\hat{A}}^{[1]}$ . Note that only marginal utilities are involved in this step, hence the procedure for obtaining ${\hat{A}}^{[1]}$ can be viewed as a feature screening technique for PLM. Among the candidate covariates x_j in ${\hat{A}}^{[1]}$ , we continue to assess the partial correlations given each individual x_k, $k \in {\hat{A}}^{[1]}$ . The insignificant x_js are further deleted, and the second-step active set ${\hat{A}}^{[2]} \subseteq {\hat{A}}^{[1]}$ is obtained. Then the partial correlations given two and more covariates in the current active set are evaluated in a sequential fashion. The procedure naturally stops when the cardinality of the given covariate set exceeds that of the current active set. Then any model fitting techniques for PLM in literature can be applied for estimating the nonzero coefficients of the linear term, as well as the nonparametric baseline function. For the sake of simplicity, the least squared estimates ${\hat{β}}_{j} s$ are computed for the nonzero coefficients, and the nonparametric function is estimated by $\hat{g} (u) = S (y - X \hat{β})$ with the plug-in $\hat{β}$ . We summarize the whole procedure in Algorithm 1, in which we follow [10] to set

T (α, n, κ, | S |) = \frac{exp {2 \sqrt{1 + κ} Φ^{- 1} (1 - α / 2) / \sqrt{n - | S | - 1}} - 1}{exp {2 \sqrt{1 + κ} Φ^{- 1} (1 - α / 2) / \sqrt{n - | S | - 1}} + 1} and \hat{κ} = \frac{1}{p} \sum_{j = 1}^{p} [\frac{n^{- 1} \sum_{i = 1}^{n} {(x_{i j} - {\bar{x}}_{j})}^{4}}{3 {n^{- 1} \sum_{i = 1}^{n} {(x_{i j} - {\bar{x}}_{j})}^{2}}^{2}} - 1],

where Φ⁻¹ is the inverse function of the cumulative distribution function of $N (0, 1)$ , ${\bar{x}}_{j}$ is the sample mean of the jthe element of x, and x_ij is the jth element of x_i.

We remark here that bandwidths h used in the smoothing matrix S(h)s are chosen differently as h_y and h₁, …, h_p for estimating the conditional means of y and every x_j. Several bandwidth selection techniques can be adopted, and we use the plug-in method by [13] rather than the cross-validation method for saving computational cost. In addition, one needs to select the bandwidth again after the active predictors are detected and the nonparametric baseline function is refitted.

Next we discuss the theoretical properties of the proposed TPC-PR procedure. We first advocate the model selection consistency in the following theorem under the ultrahigh dimensional PLM setting.

\bar{\underline{\begin{array}{l} \underline{Algorithm 1 TPC-PR Procedure for PLM} \\ 1 . Compute the sample partial residuals \hat{y} * and \hat{X} * by (5) . \\ 2 . Set m = 1, construct the first-step estimated active set by evaluating the marginal correlations between partial \\ residuals: \\ {\hat{A}}^{[1]} = {j \in {1, \dots, p} : | \hat{ρ} (y *, x_{j}^{*}) | > T (α, n, \hat{κ}, 0)} . \\ 3 . Let m = m + 1 . Establish the m th-step estimated active set as \\ {\hat{A}}^{[m]} = {j \in {\hat{A}}^{[m - 1]} : | \hat{ρ} (y *, x_{j}^{*}) | > T (α, n, \hat{κ}, m - 1)}, \forall_{S \subseteq {\hat{A}}^{[m - 1]} \ {j}} | S | = m - 1} . \\ 4 . Repeat Step 3 until | {\hat{A}}^{[m]} | \leq m . \\ 5 . The estimated coefficient vector \hat{β} = {({\hat{β}}_{1}, \dots, {\hat{β}}_{p})}^{⊤} is defined as follows: {\hat{β}}_{j} = 0 if j \notin {\hat{A}}^{[m]}; {\hat{β}}_{j} is the least \\ squares estimate by regressing the partial residuals for j \in {\hat{A}}^{[m]} . \\ 6 . Obtain the estimated nonparametric baseline function by \hat{g} (u) = S (y - X \hat{β}) . \end{array}}}

Theorem 1. Assume regularity conditions (B1)–(B8) in Appendix A for the partially linear model (1). Further assume the partial faithfulness from Definition 1. Then there exist a sequence α → 0 and constants a and b, where 0 < a + b < (1 − 2d)/5 with d ∈ (0,1/2) such that for $p = O {exp (n^{a})}$ and $| A | = O (n^{b})$ , we have $Pr ({\hat{A}}^{[m]} \neq A) \to 0$ , as n → ∞.

Theorem 1 implies that the selected model successfully captures the true one with probability tending to 1. In the proof of this theorem, we indeed show the probability of selected model being the true one tends to 1 at an exponential rate. The result is more challenging than [10] since the approximation error of partial residuals has to be taken into consideration. Theorem 2 below states the $\sqrt{n}$ -consistency of the estimated linear coefficients, as well as the asymptotical normality of the estimated nonparametric baseline function.

Theorem 2. Under the same conditions as in Theorem 1, for $p = O {exp (n^{a})}$ and $| A | = O (n^{b})$ , where a and b are defined as Theorem 1, we have $‖ \hat{β} - β ‖ = O_{p} (n^{- 1 / 2})$ , where $‖ \cdot ‖$ refers to the L₂ norm. Furthermore,

\sqrt{n h} {\hat{g} (u) - g (u) - g'' (u) μ_{2} h^{2} / 2} ⇝ N [0, σ^{2} ν_{0} / f (u)],

as n → ∞, $μ_{j} = \int u^{j} K (u) d u$ , $ν_{0} = \int K^{2} (u) d u$ , and h is the bandwidth for computing the smoothing matrix for estimating the nonparametric function in the last step of the algorithm.

From Theorem 2, we further derive the asymptotic bias and variance of $\hat{g} (u)$ at any given value of u, viz.

bias (\hat{g} (u)) = g'' (u) μ_{2} h^{2} / 2 + o (h^{2}) + O_{p} (n^{- 1 / 2}), var {\hat{g} (u)} = \frac{σ^{2} ν_{0}}{n h f (u)} {1 + o (1)} .

Theorem 1–2 ensure the theoretical validity of using the selected TPC-PR model for subsequent inference.

4. Numerical studies

In this section, we conduct simulation studies to assess the finite-sample performance of TPC-PR and to empirically verify the theoretical properties stated in the last section. We then illustrate the proposed methodology on a real data example.

4.1. Simulations studies

We evaluate the performance of TPC-PR by comparing it to the penalized regression on partial residuals, with the SCAD penalty [5] and the LASSO penalty [16], respectively. That is, we transform the PLM to linear models via partial residual technique, followed by the penalized least squared estimation procedure. The nonparametric baseline is estimated in the same fashion as TPC-PR. Furthermore, the PC-simple algorithm proposed by [1] is also studied based on the partial residuals, and is denoted as PC-PR. The distinction between TPC-PR and PC-PR is that we take the kurtosis into consideration in TPC-PR, and hence the normality assumption is not necessary for conducting the algorithm. The PC-PR, however, relies heavily on the normal distribution of the error term when sequentially testing the partial correlations.

To further enhance the finite-sample performance of TPC-PR, we in practice may consider a fine tuning on the critical value. Specifically, we use $c T (α, n, \hat{κ}, m)$ as the threshold, where c is the tuning parameter chosen by minimizing the extended Bayesian information criterion [2],

ln ({\hat{σ}}^{2}) + d f \times ln (p) \times ln (n) / n,

where ${\hat{σ}}^{2}$ is the estimated error variance of the PLM and d f is the number of nonzero estimated coefficients. The modified TPC-PR is denoted by TPC-PR-EBIC.

We conduct the simulation study under three dimension settings: low dimension (p = 20), medium dimension (p = 200), and high dimension (p = 500), with sample size n = 200. For each setting, we consider two distributions: normal distribution and mixture normal distribution, to the covariate vector x. The normal samples with autoregressive correlation are generated in the following fashion. We first draw (x₁, …, x_p+1)^⊤ from a multivariate normal distribution $N (0, Σ)$ , where Σ is the covariance matrix with correlation ρ^|i−j| between x_i and x_j, ρ = 0.5 and 0.8, and variance is taken to be 0.25 and 1 to represent strong and weak signal, respectively. Then let u = Φ(x_p+1), where Φ is the cumulative distribution function of standard normal distribution, $N (0, 1)$ . By doing this, u and x are generated correlated, and u follows a uniform distribution. In the same routine, we draw random samples from a mixture of two normal distributions, viz. $0.9 N (0, Σ) + 0.1 N (0, 9 Σ)$ . The true coefficient vector β = (3,1.5,0,0,2,0, …, 0)^⊤, and hence $A = {1, 2, 5}$ . Finally, we define the nonparametric baseline function g(u) = u² and sin(2πu) in two scenarios. The experiment is repeated 1000 times, and the following criteria are adopted to assess the performance of all procedures.

For evaluating model selection consistency:
- False positive number (FP): the number of zero coefficients erroneously detected to be nonzero, i.e.,
  $FP = \sum_{j = 1}^{p} 1 ({\hat{β}}_{j} \neq 0, β_{j} = 0) .$
- True positive number (TP): the number of nonzero coefficients correctly detected to be nonzero, i.e.,
  $TP = \sum_{j = 1}^{p} 1 ({\hat{β}}_{j} \neq 0, β_{j} = 0) .$
- Under-fit percentage (Under-fit): the proportion of missing at least one of truly active covariates in the linear part.
- Correctly-fit percentage (Cor-fit): the proportion of identifying exactly the truly active set.
- Over-fit percentage (Over-fit): the proportion of identifying all the truly active covariates, but including at least one inactive covariate erroneously.
For evaluating the $\sqrt{n}$ -consistency of linear coefficients:
- Model error (ME) due to the linear part: $ME = E [{x (\hat{β} - β)}^{2}] = {(\hat{β} - β)}^{⊤} cov (x) (\hat{β} - β)$ .
For evaluating the performance of the estimated nonparametric baseline:
- Square root of average squared errors (RASE) defined by
  $RASE = {\frac{1}{N_{g}} \sum_{k = 1}^{N_{g}} {\hat{g} (ν_{k}) - g (ν_{k})}^{2}}^{1 / 2},$
  where $ν_{1}, \dots, ν_{N_{g}}$ are the grid points at which the functions are evaluated, and N_g is the number of grid points.

The medians of ME and RASE, along with the respective medians of their absolute deviations (Devi) among the 1000 simulations are recorded. For other criteria, we report the average over the 1000 simulations.

We present the simulation results in Table 1 for the high-dimensional case (p = 500) with mixture normal distribution imposed on the error term and the correlation ρ = 0.5, and in Table 2 for that with ρ = 0.8. The rest of results are attached in the Online Supplement.

Table 1:

Simulation results for mixture of normals when p = 500 and ρ = 0.5.

Method	ME(Devi)	TP	FP	Under-fit	Cor-fit	Over-fit	RASE(Devi)
σ² = 0.25, g(u) = u²

SCAD	0.0074 (0.0042)	3.000	1.245	0.000	0.640	0.360	0.0917 (0.0247)
LASSO	0.0446 (0.0120)	3.000	32.170	0.000	0.000	1.000	0.1027 (0.0267)
PC-PR	0.0096 (0.0051)	3.000	0.260	0.000	0.760	0.240	0.0893 (0.0255)
TPC-PR	0.0066 (0.0040)	3.000	0.005	0.000	0.995	0.005	0.0895 (0.0250)
TPC-PR-EBIC	0.0066 (0.0040)	3.000	0.000	0.000	1.000	0.000	0.0895 (0.0251)

σ² = 0.25, g(u) = sin(2πu)

SCAD	0.0072 (0.0040)	3.000	1.215	0.000	0.650	0.350	0.1117 (0.0204)
LASSO	0.0649 (0.0213)	3.000	28.750	0.000	0.005	0.995	0.1541 (0.0322)
PC-PR	0.0142 (0.0100)	3.000	0.220	0.000	0.785	0.215	0.1351 (0.0279)
TPC-PR	0.0117 (0.0076)	2.990	0.015	0.010	0.990	0.000	0.1340 (0.0281)
TPC-PR-EBIC	0.0114 (0.0072)	3.000	0.000	0.000	1.000	0.000	0.1340 (0.0278)

σ² = 1, g(u) = u²

SCAD	0.0342 (0.0182)	3.000	4.965	0.000	0.535	0.465	0.1803 (0.0525)
LASSO	0.1736 (0.0430)	3.000	42.945	0.000	0.000	1.000	0.1913 (0.0488)
PC-PR	0.0605 (0.0274)	3.000	0.935	0.000	0.290	0.710	0.1763 (0.0513)
TPC-PR	0.0248 (0.0134)	3.000	0.040	0.000	0.960	0.040	0.1737 (0.0513)
TPC-PR-EBIC	0.0241 (0.0134)	3.000	0.005	0.000	0.995	0.005	0.1737 (0.0508)

σ² = 1, g(u) = sin(2πu)

SCAD	0.0329 (0.0170)	3.000	4.680	0.000	0.555	0.445	0.2010 (0.0395)
LASSO	0.1748 (0.0445)	3.000	41.800	0.000	0.000	1.000	0.2172 (0.0408)
PC-PR	0.0614 (0.0277)	3.000	0.920	0.000	0.295	0.705	0.1991 (0.0402)
TPC-PR	0.0257 (0.0140)	3.000	0.040	0.000	0.960	0.040	0.1968 (0.0402)
TPC-PR-EBIC	0.0250 (0.0136)	3.000	0.005	0.000	0.995	0.005	0.1947 (0.0419)

Open in a new tab

Table 2:

Simulation results for mixture of normals when p = 500 and ρ = 0.8.

Method	ME(Devi)	TP	FP	Under-fit	Cor-fit	Over-fit	RASE(Devi)
σ² = 0.25, g(u) = u²

SCAD	0.0100 (0.0057)	3.000	1.185	0.000	0.640	0.360	0.1121 (0.0316)
LASSO	0.0658 (0.0204)	3.000	29.150	0.000	0.000	1.000	0.1915 (0.0571)
PC-PR	0.0121 (0.0067)	3.000	0.195	0.000	0.815	0.185	0.1089 (0.0345)
TPC-PR	0.0107 (0.0064)	2.995	0.015	0.005	0.985	0.010	0.1089 (0.0332)
TPC-PR-EBIC	0.0101 (0.0060)	3.000	0.000	0.000	1.000	0.000	0.1084 (0.0328)

σ² = 0.25, g(u) = sin(2πu)

SCAD	0.0102 (0.0060)	3.000	1.120	0.000	0.660	0.340	0.1272 (0.0293)
LASSO	0.0902 (0.0300)	3.000	32.095	0.000	0.000	1.000	0.2326 (0.0555)
PC-PR	0.0150 (0.0099)	2.975	0.215	0.025	0.800	0.175	0.1527 (0.0313)
TPC-PR	0.0122 (0.0084)	2.950	0.065	0.050	0.935	0.015	0.1529 (0.0318)
TPC-PR-EBIC	0.0117 (0.0078)	2.995	0.020	0.005	0.980	0.015	0.1512 (0.0304)

σ² = 1, g(u) = u²

SCAD	0.0459 (0.0232)	3.000	4.160	0.000	0.515	0.485	0.2121 (0.0557)
LASSO	0.1736 (0.0430)	3.000	42.945	0.000	0.000	1.000	0.1913 (0.0488)
PC-PR	0.0605 (0.0274)	3.000	0.935	0.000	0.290	0.710	0.1763 (0.0513)
TPC-PR	0.0248 (0.0134)	3.000	0.040	0.000	0.960	0.040	0.1737 (0.0513)
TPC-PR-EBIC	0.0241 (0.0134)	3.000	0.005	0.000	0.995	0.005	0.1737 (0.0508)

σ² = 1, g(u) = sin(2πu)

SCAD	0.0456 (0.0235)	3.000	4.120	0.000	0.540	0.460	0.2277 (0.0504)
LASSO	0.2464 (0.0676)	3.000	37.635	0.000	0.000	1.000	0.3521 (0.0853)
PC-PR	0.0595 (0.0307)	3.000	0.730	0.000	0.430	0.570	0.2317 (0.0504)
TPC-PR	0.0343 (0.0193)	2.970	0.055	0.030	0.945	0.020	0.2296 (0.0550)
TPC-PR-EBIC	0.0327 (0.0176)	3.000	0.000	0.000	1.000	0.000	0.2290 (0.0544)

Open in a new tab

From Tables 1–2, the methods developed in this paper (TPC-PR and TPC-PR-EBIC) can successfully identify the three truly active covariates (TP ≈ 3), with fairly low average numbers of falsely including inactive covariates (FP ≈ 0). Similarly, the correct-fit rates approach 1, illustrating the model selection consistency. The under-fit and over-fit rate are both close to 0. The model error (ME) and the square root of average squared errors (RASE) are small enough to show the $\sqrt{n}$ -consistency of the coefficient estimations and the validity of the nonparametric baseline estimation.

In terms of comparison, the selection methods based on partial correlations favors sparser models than the penalized regression approaches in general. According to Tables 1 and 2, the last three methods consistently outperform the first two penalized regression approaches, especially LASSO, which, as expected, overfits data and yields conservative models. Although the correct-fit probability for SCAD (around 0.6) is much larger than LASSO, but is still left behind by the other three methods. The ME and RASE illustrate the same phenomenon, no matter which functional form of the baseline g(u) is assumed.

Among the three partial-correlation-based methods, PC-PR yields the worst results, especially when the noise-tosignal ratio is high (σ² = 1): PC-PR yields only about 29% among all experiments that can identify the true model, while the TPC-PR and TPC-PR-EBIC both exceed 95%. This is due to the fact that PC-PR uses the wrong variance estimation for the testing statistics when normality assumptions are not satisfied. Meanwhile, TPC-PR involves the kurtosis into the limiting distribution of the partial correlations, and hence corrects the variance estimations. Under the normal distribution setting (the corresponding results are provided in the Online Supplement), PC-PR and TPC-PR perform similarly, since the asymptotic distributions are identical for the two testing statistics under normality assumption. TPC-PR with EBIC fine tuning indeed enhances the finite-sample performances, although TPC-PR already behaves satisfactory under most circumstances and is sufficient for practical application. The same pattern can be observed for both signal-to-noise ratios and both values that correlation ρ takes.

Comparing Tables 1 and 2, we observe that as ρ increases from 0.5 to 0.8, it becomes slightly more challenging to identify the true covariates. The model tends to be overfitted and the covariates are highly correlated with each other. And in the high-correlation scenario, the improvement by TPC-PR-EBIC is more significant. Finally, when the signal-to-noise ratio increases from 0.25 to 1, PC-PR works dramatically worse, but the other methods behave relatively robust. We have also compared computing times of these methods. In general, the LASSO method uses the least computing time. For a simulated data set with p = 500, the TPC-PR takes about 20 seconds, while the LASSO takes 0.1 second, and the one-step SCAD takes about 3 seconds. It is clear that the TPC-PR is slower than the LASSO and the one-step SCAD due to the recursive nature of the proposed Algorithm 1. However, TPC-PR can be carried out with a reasonable amount of time.

4.2. Supermarket data analysis

In this section, we apply the TPC-PR method to analyze a high dimensional data set from a supermarket. The data set consists of 464 daily records of the number of customers entering the supermarket, as well as the sales volume of 6398 products in the market. We suspect nonlinear relation between the dates and popularity of the store, thus a partially linear model is a plausible choice for fitting the data.

We apply SCAD, LASSO, PC-PR, TPC-PR, and TPC-PR-EBIC as the stock example. The model sizes and the prediction errors are reported in Table 3. SCAD and LASSO still yields much more conservative models with 28 and 39 selected variables than the partial correlation based methods, while the corresponding prediction errors are higher. Compared with PC-PR and TPC-PR, the TPC-PR-EBIC model is even sparser, with a slight sacrifice of prediction error. The time effect on the number of customers are depicted in Figure 1. Some periodic pattern is observed.

Table 3:

Model sizes and prediction errors for the market data analysis.

Approach	Model Size	Prediction Error
SCAD	28	16.87
LASSO	39	17.53
PC-PR	10	16.36
TPC-PR	10	16.36
TPC-PR-EBIC	7	16.97

Open in a new tab

The estimated curve of the number of customers against dates.

5. Conditions and proofs

5.1. Regularity conditions and lemmas

The following regularity conditions are imposed to facilitate the proofs.

(B1) For j ∈ {1, …, p}, the conditional expectations E(y|u), E(x_j|u), E(yx_j|u), E(y²|u), and $E (x_{j}^{2} | u)$ are all uniformly bounded in $U$ , where $U$ is the bounded support of u. Furthermore, we assume there exists δ₁ > 0 such that (i) E{var(y|u)} ≥ δ₁ and (ii) E{var(x_j|u)} ≥ δ₁.

(B2) x_j and y satisfy the sub-exponential tail probability uniformly in u. That is, there exists s₀ > 0 such that for s ∈ (0, s₀),

sup_{u \in U} max_{j \in {1, \dots, p}} E {exp (s x_{j}^{2}) | u} < \infty, sup_{u \in U} max_{j \in {1, \dots, p}} E {exp (s x_{j} y) | u} < \infty, sup_{u \in U} E {exp (s y^{2}) | u} < \infty .

(B3) The partial correlations $ρ (y *, x_{j}^{*} | x_{S}^{*})$ satisfy

inf {| ρ (y *, x_{j}^{*} | x_{S}^{*}) | : j \in {1, \dots, p}, S \subseteq {j}^{C}, | S | \leq | A |, ρ (y *, x_{j}^{*} | x_{S}^{*}) \neq 0} \geq c_{n},

where $c_{n} = O (n^{- d})$ , and d ∈ (0,1/2).

(B4) The partial correlations $ρ (y *, x_{j}^{*} | x_{S}^{*})$ and $ρ (x_{j}^{*}, x_{k}^{*} | x_{S}^{*})$ satisfy

$sup {| ρ (y *, x_{j}^{*} | x_{S}^{*}) | : j \in {1, \dots, p}, S \subseteq {j}^{C}, | S | \leq | A |} \leq τ < 1$
$sup {| ρ (x_{i}^{*}, x_{j}^{*} | x_{S}^{*}) | : i, j \in {1, \dots, p}, i \neq j, S \subseteq {i, j}^{C}, | S | \leq | A |} \leq τ < 1$ .

(B5) Let f(u) be the density function of $u \in U$ . Assume f(u) is bounded from 0; and f(u), its first derivative f′(u) and second derivative f′′(u) are bounded uniformly in u. That is, there exist δ₃ > 0 and M₁ > 0 such that

inf_{u \in U} | f (u) | \geq δ_{3}, sup_{u \in U} | f (u) | \leq M_{1}, sup_{u \in U} | f' (u) | \leq M_{1}, sup_{u \in U} | f'' (u) | \leq M_{1} .

(B6) The kernel function K is a symmetric density function, and it is bounded and has finite second moment, i.e.,

sup_{u \in U} | K (u) | < \infty, and \int_{u \in U} u^{2} K (u) d u < \infty .

(B7) (v, x ,y) has an elliptical distribution EC_p+2(µ, Σ, ϕ), and there exists a function ψ such that the σ-field generated by u = ψ(v) is the same as that generated by v.

(B8) The bandwidths satisfy h_y → 0, h_j → 0, $n h_{y}^{3} \to \infty$ , and $n h_{j}^{3} \to \infty$ , for all j ∈ {1, …, p}.

We further state some lemmas to facilitate the technical proofs of theorem. The proofs of Lemma 3 to 5 below follow the same techniques as in [12].

Lemma 3. We adopt the following notation for simplicity:

Z_{1} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} K_{h} (u_{i} - u), Z_{2} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} (\frac{u_{i} - u}{h}) K_{h} (u_{i} - u), Z_{3} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{2} K_{h} (u_{i} - u), Z_{4} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} x_{i j} K_{h} (u_{i} - u), Z_{5} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} x_{i j} (\frac{u_{i} - u}{h}) K_{h} (u_{i} - u), Z_{6} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} y_{i} K_{h} (u_{i} - u), Z_{7} (u, h) = \frac{1}{n} \sum_{i = 1}^{n} y_{i} (\frac{u_{i} - u}{h}) K_{h} (u_{i} - u) .

Then under conditions (B1), (B3)–(B6) and (B8), for some small s > 0, and any ϵ > 0, we have the following results:

${sup}_{u \in U} Pr {| Z_{1} (u, h) - f (u) | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{2} (u, h) - f (u) \int_{- \infty}^{+ \infty} t K (t) d t | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{3} (u, h) - f (u) \int_{- \infty}^{+ \infty} t^{2} K (t) d t | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{4} (u, h) - f (u) E (x_{j} | u) | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{5} (u, h) - f (u) E (x_{j} | u) \int_{- \infty}^{+ \infty} t K (t) d t | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{6} (u, h) - f (u) E (y | u) | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ ,
${sup}_{u \in U} Pr {| Z_{7} (u, h) - f (u) E (y | u) \int_{- \infty}^{+ \infty} t K (t) d t | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ .

Lemma 4. Assume A(u) and B(u) are two uniformly bounded functions of u. That is, there exist M₄ and M₅ such that

sup_{u \in U} | A (u) | \leq M_{4}, sup_{u \in U} | B (u) | \leq M_{5} .

For any given u, $\hat{A} (u)$ and $\hat{B} (u)$ are estimates of A(u) and B(u) based on n samples. Suppose there exist C₁, …, C₄ and q > 0 such that

sup_{u \in U} Pr {| \hat{A} (u) - A (u) | > ϵ} \leq C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})}, sup_{u \in U} Pr {| \hat{B} (u) - B (u) | > ϵ} \leq C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Then

sup_{u \in U} Pr {| \hat{A} (u) \hat{B} (u) - A (u) B (u) | > ϵ} \leq C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Furthermore, if ${inf}_{u \in U} | B (u) | \geq δ_{3} > 0$ , then

sup_{u \in U} Pr {| \hat{A} (u) / \hat{B} (u) - A (u) / B (u) | > ϵ} \leq C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} . sup_{u \in U} Pr {| {\hat{B} (u)}^{1 / 2} - {B (u)}^{1 / 2} | > ϵ} \leq C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Lemma 5. Define

W_{j} (u, h) = \frac{Z_{3} (u, h) Z_{4} (u, h) - Z_{2} (u, h) Z_{5} (u, h)}{Z_{1} (u, h) Z_{3} (u, h) - Z_{2} (u, h) Z_{2} (u, h)}, V (u, h) = \frac{Z_{3} (u, h) Z_{6} (u, h) - Z_{2} (u, h) Z_{7} (u, h)}{Z_{1} (u, h) Z_{3} (u, h) - Z_{2} (u, h) Z_{2} (u, h)} .

Then under the same conditions as Lemma 3, we have

${sup}_{u \in U} Pr {| W_{j} (u, h) - E (x_{j} | u) | > ϵ} \leq 4 {(1 - s ϵ / 4)}^{n}$ .
${sup}_{u \in U} Pr {| V (u, h) - E (y | u) > ϵ |} \leq 4 {(1 - s ϵ / 4)}^{n}$ .

5.2. Proof of Theorem 1

We divide the proof into six steps.

Step 1: First note that

\hat{ρ} (y *, x_{j}^{*} | x_{k}^{*}) = \frac{\hat{ρ} (y *, x_{j}^{*}) - \hat{ρ} (y *, x_{k}^{*}) \hat{ρ} (x_{j}^{*}, x_{k}^{*})}{{[{1 - {\hat{ρ}}^{2} (y *, x_{k}^{*})} {1 - {\hat{ρ}}^{2} (x_{j}^{*}, x_{k}^{*})}]}^{1 / 2}}

is a function of $\hat{ρ} (y *, x_{j}^{*})$ , $\hat{ρ} (y *, x_{k}^{*})$ , and $\hat{ρ} (x_{j}^{*}, x_{k}^{*})$ . Let $g (x, y, z) = (x - y z) / \sqrt{(1 - y^{2}) (1 - z^{2})}$ , and x, y, z ∈ (−1,1), then all the first and second derivatives are bounded away from 1, given y and z are bounded away from 1.

By Theorem 1 in [10],

Pr {| \hat{ρ} (y *, x_{j}^{*} | x_{k}^{*}) - ρ (y *, x_{j}^{*} | x_{k}^{*}) | > ϵ} = Pr {| g {\hat{ρ} (y *, x_{j}^{*}), \hat{ρ} (y *, x_{k}^{*}), \hat{ρ} (x_{j}^{*}, x_{k}^{*})} - g {ρ (y *, x_{j}^{*}), ρ (y *, x_{k}^{*}), ρ (x_{j}^{*}, x_{k}^{*})} | > ϵ} \leq Pr {{‖ (\begin{matrix} \hat{ρ} (y *, x_{j}^{*}) \\ \hat{ρ} (y *, x_{k}^{*}) \\ \hat{ρ} (x_{j}^{*}, x_{k}^{*}) \end{matrix}) - (\begin{matrix} ρ (y *, x_{j}^{*}) \\ ρ (y *, x_{k}^{*}) \\ ρ (x_{j}^{*}, x_{k}^{*}) \end{matrix}) ‖}_{2} > C ϵ} \leq Pr {| \hat{ρ} (y *, x_{j}^{*}) - ρ (y *, x_{j}^{*}) | > C ϵ / \sqrt{3}} + Pr {| \hat{ρ} (y *, x_{k}^{*}) - ρ (y *, x_{k}^{*}) | > C ϵ / \sqrt{3}} + Pr {| \hat{ρ} (x_{j}^{*}, x_{k}^{*}) - ρ (x_{j}^{*}, x_{k}^{*}) | > C ϵ / \sqrt{3}} \leq 3 C_{1} n {(1 - C_{2} ϵ^{2} / n^{4 q}) + exp (- C_{3} n^{q})} .

Similarly, for any $S \subseteq {j}^{C}$ , and $| S | \leq | A |$ , we can have

Pr {| \hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}) - ρ (y *, x_{j}^{*} | x_{S}^{*}) | > ϵ} \leq 3^{| S |} C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Step 2: If the distribution is assumed to be elliptical, we can use the sample version of the marginal kurtosis, viz.

\hat{κ} = \frac{1}{n} \sum_{j = 1}^{p} [- 1 + \frac{1}{n} \sum_{i = 1}^{n} {({\tilde{x}}_{i j} - {\bar{\tilde{x}}}_{j})}^{4} / {\frac{3}{n} \sum_{i = 1}^{n} {({\tilde{x}}_{i j} - {\bar{\tilde{x}}}_{j})}^{2}}^{2}],

to estimate the kurtosis. Similar to the proof in Step 1, we can obtain the following inequality:

Pr (| \hat{κ} - κ | > ϵ) \leq C_{4} exp (- C_{3} n^{1 - 4 q} ϵ^{2}) + C_{2} n exp (- C_{1} n^{q}) .

Step 3: To study $Pr (| {\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + \hat{κ}} - Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + κ} | > ϵ)$ , define $g_{2} (u, v) = ln {(1 + u)} / (1 - u)} / (2 \sqrt{1 + v})$ for all u ∈ (−1,1) and v ∈ (−1, ∞). Then

{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + \hat{κ}} = g_{2} {\hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}), \hat{κ}} and Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + κ} = g_{2} {ρ (y *, x_{j}^{*} | x_{S}^{*}), κ} .

All the first and second derivatives are continuous and bounded for u ∈ (−τ, τ),v ∈ (−δ, +∞). By Theorem 1 in [10], under (B3) and (B5),

Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | > ϵ} \leq Pr {‖ (\begin{matrix} \hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}) \\ \hat{κ} \end{matrix}) - (\begin{matrix} ρ (y *, x_{j}^{*} | x_{S}^{*}) \\ κ \end{matrix}) ‖ > C ϵ} \leq Pr (| \hat{ρ} (y *, x_{j}^{*} | x_{S}^{*}) - ρ (y *, x_{j}^{*} | x_{S}^{*}) | > C ϵ / \sqrt{2}) + Pr (| \hat{κ} - κ | > C ϵ / \sqrt{2}) \leq (3^{| S |} + 1) C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} \leq 3^{| S |} C_{1} n {{(1 - C_{2} ϵ^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Step 4: Next we compute $Pr (E_{j | S})$ . When testing the jth predictor given $S \subseteq {j}^{C}$ , denote the event

E_{j | S} = {an error occurs when testing ρ (y *, x_{j}^{*} | x_{S}^{*}) = 0} = E_{j | S}^{I} \cup E_{j | S}^{I I},

where $E_{j | S}^{I}$ denotes Type I error while $E_{j | S}^{I I}$ represents Type II error. We have

E_{j | S}^{I} = {{(n - | S | - 1)}^{1 / 2} | {\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + \hat{κ}} | > Φ^{- 1} (1 - α / 2) when Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) = 0} .

Then

Pr (E_{j | S}^{I}) = Pr {{(n - S | - 1)}^{1 / 2} | \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} | > Φ^{- 1} (1 - α / 2) when Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) = 0} \leq Pr {{(n - | S | - 1)}^{1 / 2} | \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | > Φ^{- 1} (1 - α / 2)} \leq Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | > \sqrt{\frac{n}{n - | S | - 1}} \frac{c_{n}}{2 \sqrt{1 + κ}}} \leq Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | > \frac{c_{n}}{2 \sqrt{1 + κ}}} \leq 3^{| S |} C_{1} n {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})},

by choosing $α = 2 {1 - Φ (c_{n} \sqrt{n / (1 + κ)} / 2)}$ . Furthermore,

E_{j | S}^{I I} = {{(n - | S | - 1)}^{1 / 2} | {\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + \hat{κ}} | \leq Φ^{- 1} (1 - α / 2) when Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) \neq 0} .

By choosing $α = 2 [1 - Φ {\sqrt{n / (1 + κ)} c_{n} / 2}]$ , we can get the following inequality:

Pr (E_{j | S}^{I I}) = Pr {{(n - | S | - 1)}^{1 / 2} | {\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*}) / \sqrt{1 + \hat{κ}} | \leq Φ^{- 1} (1 - α / 2) when Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) \neq 0} = Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} | \leq \sqrt{\frac{n}{n - | S | - 1}} \frac{c_{n}}{2 \sqrt{1 + κ}} when Z_{n} (y *, x_{j}^{*} | x_{S}^{*}) \neq 0} \leq Pr {| \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | - | \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | \leq \sqrt{\frac{n}{n - | S | - 1}} \frac{c_{n}}{2 \sqrt{1 + κ}}} = Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} | - | \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | \geq | \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | - \sqrt{\frac{n}{n - | S | - 1}} \frac{c_{n}}{2 \sqrt{1 + κ}}} .

Let g₃(u) = (1/2) × ln{(1+u)/(1 − u)}, then |g₃(u)| = |1/2 × ln{(1+u)/(1 − u)}| ≥ |u|, for all u ∈ (−1,1), and according to (B4), $| ρ (y *, x_{j}^{*} | x_{S}^{*}) | \geq c_{n}$ , then

Pr (E_{j | S}^{I I}) \leq Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | \geq \frac{c_{n}}{\sqrt{1 + κ}} - \sqrt{\frac{n}{n - | S | - 1}} \frac{c_{n}}{2 \sqrt{1 + κ}}} \leq Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | \geq \frac{c_{n}}{\sqrt{1 + κ}} (1 - \sqrt{\frac{n}{n - | S | - 1}} \frac{1}{2})} \leq Pr {| \frac{{\hat{Z}}_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y *, x_{j}^{*} | x_{S}^{*})}{\sqrt{1 + κ}} | \geq \frac{3 c_{n}}{8 \sqrt{1 + κ}}} \leq 3^{| S |} C_{1} n {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})},

as for large n, $\sqrt{n / (n - | S | - 1)} \leq 5 / 4$ . Combining the above results, we have

Pr (E_{j | S}) = Pr (E_{j | S}^{I}) + Pr (E_{j | S}^{I I}) \leq 3^{| S |} C_{1} n {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Step 5: To study $Pr ({\hat{A}}^{[m]} \neq A)$ , we consider all j ∈ {1, …, p} and all $S \subseteq {j}^{C}$ subject to $| S | \leq | A |$ , for any b > 0, under the partial faithfulness assumption and (B2), for j ∈ {1, …, p}, define $K_{j} = {S \subseteq {j}^{C} : | S | \leq | A |}$ . Now

Pr {{\hat{A}}^{[m]} \neq A} = Pr {an error occurs for some j and some S} = Pr {\underset{j \in {1, \dots, p}, S \in K_{j}}{\cup} E_{j | S}} \leq \sum_{j \in {1, \dots, p}, S \in K_{j}} Pr (E_{j | S}) \leq p \times p^{| A |} sup_{j \in {1, \dots, p}, S \in K_{j}} Pr (E_{j | S}) \leq 3^{| A |} p^{| A | + 1} C_{1} n {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

The second last inequality holds since the number of possible choices of j is p, and there are $p^{| A |}$ possible choices for $S$ . Furthermore, similar to Lemma 3 in [1], we can show that $Pr (m = | A |) \to 1$ .

Step 6: Under (B2), for $p = O {exp (n^{a})} = C_{4} exp (n^{a})$ , and $| A | = O (n^{b}) = C_{5} n^{b}$ , we have

Pr ({\hat{A}}^{[m]} \neq A) \leq 3^{| A |} C_{1} n {(p)}^{| A | + 1} C_{1} n {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Therefore,

ln {Pr ({\hat{A}}^{[m]} \neq A)} \leq | A | ln 3 + (| A | + 1) ln (p) + ln (C_{1} n) + ln {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} . \leq C_{1} (| A | + 1) ln (p) + ln {{{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})}} \leq C_{1} (C_{5} n^{b} + 1) (ln C_{4} + n^{a}) + ln {{(1 - C_{2} c_{n}^{2} / n^{4 q})}^{n} + exp (- C_{3} n^{q})} .

Note that e^−x ≤ 1 − x/2, for x ∈ [0,ln2]. Since q ∈ (0,1), then C₃/n^1−q ∈ [0, ln 2].

exp (- C_{3} n^{q}) = {exp (- C_{3} n^{q} / n)}^{n} \leq {(1 - C_{2} / 2 n^{1 - q})}^{n} = {(1 - C_{3} / n^{1 - q})}^{n} .

Therefore,

ln {Pr ({\hat{A}}^{[m]} \neq A)} \leq C_{1} (C_{5} n^{b} + 1) (ln C_{4} + n^{a}) + ln {{1 - C_{2} / n^{2 d + 4 q})}^{n} + {(1 - C_{3} / n^{1 - q})}^{n}} .

Since 2d + 5(a + b) < 1, then (1 – 2d)/5 < (1 – a – b 2d)/4, then $n^{4 q} < n^{(1 - a - b - 2 d) / 4}$ , then

\frac{n^{a + b}}{n ln {(1 - C_{2} / n^{2 d + 4 q})}} \approx \frac{n^{a + b}}{n C_{2} / n^{2 d + 4 q} {- 1 + o (1)}} \approx \frac{~ C_{2} n^{4 q}}{n^{- 1 - a - b - 2 d}} \to 0

and

C_{1} n ln {(1 - C_{2} / n^{2 d + 4 q})} \approx C_{1} n C_{2} / n^{2 d + 4 q} {- 1 + o (1)} \to - \infty .

Thus,

ln {Pr ({\hat{A}}^{[m]} \neq A) \leq C_{1} n^{a + b} + n ln (1 - C_{2} / n^{2 d + 4 q})} \leq C_{1} n ln {(1 - C_{2} / n^{2 d + 4 q})} {\frac{n^{a + b}}{n ln {(1 - C_{2} / n^{2 d + 4 q})}} + 1} \to - \infty .

As a result, $Pr ({\hat{A}}^{[m]} \neq A) \to 0$ . This complete the proof of Theorem 1. □

5.3. Proof of Theorem 2

Since we are applying the least squares on the estimated active set, then as n → ∞, $‖ \hat{β} - β ‖ = O_{p} (- n^{1 / 2})$ when ${\hat{A}}^{[m]} = A$ . And according to Theorem 1, $Pr ({\hat{A}}^{[m]} = A) \to 1$ . Thus $‖ \hat{β} - β ‖ = O_{p} (- n^{1 / 2})$ .

We next focus on the asymptotic normality of the nonparametric estimation of g(u). We complete the proof in three steps.

Step 1: First we derive the bias of $\hat{g} (u)$ . After obtaining the estimated active set, for any fixed u, we have

\hat{b} (u) - (\begin{matrix} {\hat{b}}_{0} (u) \\ {\hat{b}}_{1} (u) \end{matrix}) = \underset{b_{0}, b_{1}}{arg min} \sum_{i = 1}^{n} {y_{i} - x_{i}^{⊤} \hat{β} - b_{0} - b_{1} (u_{i} - u_{0})}^{2} K_{h} (u_{i} - u_{0}) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} (y - X \hat{β}),

where

Z_{u} = (\begin{matrix} 1 & u_{1} - u \\ ⋮ & ⋮ \\ 1 & u_{n} - u \end{matrix}) and W_{u} = diag {K_{h} (u_{1} - u), \dots K_{h} (u_{n} - u)} .

We note the following facts:

E(y − Xβ|X, u) = g(u).
${(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} X = O_{p} (1)$ , and $‖ \hat{β} - β ‖ = O_{p} (n^{- 1 / 2})$ .
g(u_i) = b₀ + b₁(u_i − u) + b₂(u_i − u)² + · · ·, and g(u) = b₀.

Now

E (\hat{b} | X, u) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} E (y - X \hat{β} | X, u) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} E (y + X β - X \hat{β} | X, u) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} {g(u) + X E (β - \hat{β} | X, u)} = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} (\begin{matrix} b_{0} + b_{1} (u_{1} - u) + g (u) - b_{0} - b_{1} (u_{1} - u) \\ ⋮ \\ b_{0} + b_{1} (u_{1} - u) + g (u) - b_{0} - b_{1} (u_{1} - u) \end{matrix}) + O_{p} (n^{- 1 / 2}) = (\begin{matrix} b_{0} \\ b_{1} \end{matrix}) + {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} (\begin{matrix} b_{2} {(u_{1} - u)}^{2} + \dots \\ ⋮ \\ b_{2} {(u_{2} - u)}^{2} + \dots \end{matrix}) + O_{p} (n^{- 1 / 2}) = (\begin{matrix} b_{0} \\ b_{1} \end{matrix}) + J_{1} + O_{p} (n^{- 1 / 2}) .

Set

S_{n} = Z_{u}^{⊤} W_{u} Z_{u} = (\begin{matrix} \sum_{i = 1}^{n} K_{h} (u_{i} - u) & \sum_{i = 1}^{n} (u_{i} - u) K_{h} (u_{i} - u) \\ \sum_{i = 1}^{n} (u_{i} - u) K_{h} (u_{i} - u) & \sum_{i = 1}^{n} {(u_{i} - u)}^{2} K_{h} (u_{i} - u) \end{matrix}), H = (\begin{matrix} 1 & 0 \\ 0 & h \end{matrix}) .

Then

\frac{1}{n} H^{- 1} S_{n} H^{- 1} = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} K_{h} (u_{i} - u) & \frac{1}{n h} \sum_{i = 1}^{n} (u_{i} - u) K_{h} (u_{i} - u) \\ \frac{1}{n h} \sum_{i = 1}^{n} (u_{i} - u) K_{h} (u_{i} - u) & \frac{1}{n h^{2}} \sum_{i = 1}^{n} {(u_{i} - u)}^{2} K_{h} (u_{i} - u) \end{matrix}) = f (u) (\begin{matrix} 1 & 0 \\ 0 & μ_{2} \end{matrix}) + O {h + \sqrt{1 / (n h)}} .

Next,

Z_{u}^{⊤} W_{u} (\begin{matrix} {(u_{1} - u)}^{l} \\ ⋮ \\ {(u_{n} - u)}^{l} \end{matrix}) = (\begin{matrix} \frac{n h^{l}}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{l} K_{h} (u_{i} - u) \\ \frac{n h^{l + 1}}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{l + 1} K_{h} (u_{i} - u) \end{matrix}) .

Then J₁ can be expressed as follows:

J_{1} = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} (\begin{matrix} b_{2} {(u_{1} - u)}^{2} + \dots \\ ⋮ \\ b_{2} {(u_{n} - u)}^{2} + \dots \end{matrix}) = b_{2} S_{n}^{- 1} (\begin{matrix} n h^{2} \frac{1}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{2} K_{h} (u_{i} - u) \\ n h^{3} \frac{1}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{3} K_{h} (u_{i} - u) \end{matrix}) + o (h^{2}) = b_{2} H^{- 1} (n^{- 1} H^{- 1} S_{n} H^{- 1}) (\begin{matrix} \frac{h^{2}}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{2} K_{h} (u_{i} - u) \\ \frac{h^{3}}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{3} K_{h} (u_{i} - u) \end{matrix}) + o (h^{2}) = b_{2} h^{2} H^{- 1} (n^{- 1} H^{- 1} S_{n} H^{- 1}) (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{2} K_{h} (u_{i} - u) \\ \frac{1}{n} \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{3} K_{h} (u_{i} - u) \end{matrix}) + o (h^{2}) = b_{2} h^{2} H^{- 1} [f^{- 1} (u) (\begin{matrix} 1 & 0 \\ 0 & 1 / μ_{2} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] [f (u) (\begin{matrix} μ_{2} \\ μ_{3} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] + o (h^{2}) = b_{2} h^{2} H^{- 1} [(\begin{matrix} μ_{2} \\ μ_{3} / μ_{2} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] + o (h^{2}) .

Thus,

E (\hat{g} (u) | X, u) = (1 0) E (\hat{b} | X, u) = b_{0} + b_{2} h^{2} [μ_{2} + O_{p} {h + \sqrt{1 / (n h)}}] + o (h^{2}) + O_{p} (n^{- 1 / 2}) = b_{0} + b_{2} μ_{2} h^{2} + h^{3} O_{p} {1 + \sqrt{1 / (n h^{3})}} + o (h^{2}) + O_{p} (n^{- 1 / 2}) = b_{0} + b_{2} μ_{2} h^{2} + o (h^{2}) + O_{p} (n^{- 1 / 2}),

and the last equality holds because nh³ → ∞, as n → 0. Note that the right-hand side does not depend on X and u.

Therefore,

E {\hat{g} (u)} = b_{0} + b_{2} μ_{2} h^{2} + o (h^{2}) + O_{p} (n^{- 1 / 2}) = g (u) + g'' (u) μ_{2} h^{2} / 2 + o (h^{2}) + O_{p} (n^{- 1 / 2}) .

Step 2: Derive the variance of $\hat{g} (u)$ . Note that $var (\hat{b} | X, u) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} var (y - X \hat{β} | X, u) W_{u} Z_{u} {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1}$ , and

var (y - X \hat{β} | X, u) = var (y - X \hat{β} | X β - X \hat{β} | X, u) = var (y - X β | X, u) + var (X β - X \hat{β} | X, u) + 2 cov (y - X β, X β - X \hat{β} | X, u) = {σ^{2} + O_{p} (n^{- 1 / 2})} I_{n} .

Then

var (\hat{b} | X, u) = {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} {σ^{2} I_{n} + O_{p} (n^{- 1 / 2})} W_{u} Z_{u} {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} = {σ^{2} + O_{p} (n^{- 1 / 2})} S_{n}^{- 1} (Z_{u}^{⊤} W_{u}^{2} Z_{u}) S_{n}^{- 1} = {σ^{2} + O_{p} (n^{- 1 / 2})} H^{- 1} (H^{- 1} S_{n} H^{- 1}) H^{- 1} (Z_{u}^{⊤} W_{u}^{2} Z_{u}) H^{- 1} {(H^{- 1} S_{n} H^{- 1})}^{- 1} \equiv {σ^{2} + O_{p} (n^{- 1 / 2})} J_{2} .

Furthermore,

H^{- 1} (Z_{u}^{⊤} W_{u}^{2} Z_{u}) H^{- 1} = (\begin{matrix} 1 & 0 \\ 1 & 1 / h \end{matrix}) (\begin{matrix} \sum_{i = 1}^{n} K_{h}^{2} (u_{i} - u) & \sum_{i = 1}^{n} (u_{i} - u) K_{h}^{2} (u_{i} - u) \\ \sum_{i = 1}^{n} (u_{i} - u) K_{h}^{2} (u_{i} - u) & \sum_{i = 1}^{n} {(u_{i} - u)}^{2} K_{h}^{2} (u_{i} - u) \end{matrix}) (\begin{matrix} 1 & 0 \\ 1 & 1 / h \end{matrix}) = (\begin{matrix} \sum_{i = 1}^{n} K_{h}^{2} (u_{i} - u) & \sum_{i = 1}^{n} (\frac{u_{i} - u}{h}) K_{h}^{2} (u_{i} - u) \\ \sum_{i = 1}^{n} (\frac{u_{i} - u}{h}) K_{h}^{2} (u_{i} - u) & \sum_{i = 1}^{n} {(\frac{u_{i} - u}{h})}^{2} K_{h}^{2} (u_{i} - u) \end{matrix}) = \frac{n f (u)}{h} (\begin{matrix} ν_{0} & ν_{1} \\ ν_{1} & ν_{2} \end{matrix}) + n O_{p} {1 + \sqrt{1 / (n h^{3})}}

and

J_{2} = \frac{1}{n} H^{- 1} {(n^{- 1} H^{- 1} S_{n} H^{- 1})}^{- 1} {n^{- 1} H^{- 1} (Z_{u}^{⊤} W_{u}^{2} Z_{u}) H^{- 1}} (n^{- 1} H^{- 1} S_{n} H^{- 1}) H^{- 1} = \frac{1}{n} H^{- 1} [f^{- 1} (u) (\begin{matrix} 1 & 0 \\ 0 & 1 / μ_{2} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] [\frac{f (u)}{h} (\begin{matrix} ν_{0} & ν_{1} \\ ν_{1} & ν_{2} \end{matrix}) + O_{p} {1 + \sqrt{1 / {(n h)}^{3}}}] \times [f^{- 1} (u) (\begin{matrix} 1 & 0 \\ 0 & 1 / μ_{2} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] H^{- 1} = \frac{1}{n h} H^{- 1} [\frac{1}{f (u)} (\begin{matrix} ν_{0} & ν_{1} / μ_{2} \\ ν_{1} / μ_{2} & ν_{2} / μ_{2} \end{matrix}) + O_{p} {h + \sqrt{1 / (n h)}}] H^{- 1} .

Therefore,

var {\hat{g} (u) | X, u} = {(1 0)}^{⊤} var (\hat{b} | X, u) (1 0) = \frac{σ^{2} + O_{p} (n^{- 1 / 2})}{n h} [\frac{1}{f (u)} ν_{0} + O_{p} {h + \sqrt{1 / (n h)}}] = \frac{1}{n h f (u)} [σ^{2} ν_{0} + O_{p} (n^{- 1 / 2}) + O_{p} {h + \sqrt{1 / (n h)}}] .

Notice that the right-hand side does not depend on X and u. Therefore,

var {\hat{g} (u)} = \frac{1}{n h f (u)} [σ^{2} ν_{0} + O_{p} (n^{- 1 / 2}) + O_{p} {h + \sqrt{1 / (n h)}}] .

Step 3: In order to derive the asymptotic distribution of $\hat{g} (u)$ , we can use the following facts:

$\hat{g} (u) = (1 0) \hat{b} = (1 0) {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} (y - X \hat{β})$ .
$E {\hat{g} (u) | X, u} = (1 0) E (\hat{b} | X, u) = g (u) + g'' (u) μ_{2} h^{2} / 2 + O_{p} (n^{- 1 / 2}) + o (h^{2})$ .
$E {\hat{g} (u) | X, u} = (1 0) {(Z_{u}^{⊤} W_{u} Z_{u})}^{- 1} Z_{u}^{⊤} W_{u} E (y - X \hat{β} | X, u)$ .
$y - X β - E (y - X β | X, u) = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}$ .

We first study $\hat{g} (u) + g'' (u) μ_{2} h^{2} / 2$ :

\hat{g} (u) - g (u) - g'' (u) μ_{2} h^{2} / 2 = \hat{g} (u) - E (\hat{g} (u) | X, u) + O_{p} (n^{- 1 / 2}) + o (h^{2}) = {(1 0)}^{- 1} (Z_{u}^{⊤} W_{u} Z_{u}) Z_{u}^{⊤} W_{u} {y - X \hat{β} - E (y - X \hat{β} | X, u) + O_{p} (n^{- 1 / 2}) + o (h^{2})} = (1 0) S_{n}^{- 1} Z_{u}^{⊤} W_{u} {y - X β - E (y - X β | X, u) + X (β - \hat{β}) - E (X (β - \hat{β}) | X, u) + O_{p} (n^{- 1 / 2}) + o (h^{2}) .

Because $S_{n}^{- 1} Z_{u}^{⊤} W_{u} X = O_{p} (1)$ , $‖ \hat{β} - β ‖ = O_{p} (n^{- 1 / 2})$ , then

\hat{g} (u) - g (u) - g'' (u) μ_{2} h^{2} / 2 = [1 0] S_{n}^{- 1} Z_{u}^{⊤} W_{u} {y - X β - E (y - X β | X, u} + O_{p} (n^{- 1 / 2}) + o (h^{2}) = [1 0] S_{n}^{- 1} Z_{u}^{⊤} W_{u} {(ϵ_{1}, \dots, ϵ_{n})}^{⊤} + O_{p} (n^{- 1 / 2}) + o (h^{2}) = [1 0] H^{- 1} (n H S_{n}^{- 1} H) H^{- 1} {Z_{u}^{⊤} W_{u} {(ϵ_{1}, \dots, ϵ_{n})}^{⊤} / n} + O_{p} (n^{- 1 / 2}) + o (h^{2}) .

Furthermore, given that

n^{- 1} H^{- 1} S_{n} H^{- 1} \overset{P}{\to} f (u) (\begin{matrix} 1 & 0 \\ 0 & μ_{2} \end{matrix})

we have

n H S_{n}^{- 1} H \overset{P}{\to} \frac{1}{f (u)} (\begin{matrix} 1 & 0 \\ 0 & 1 / μ_{2} \end{matrix}) .

Moreover,

H^{- 1} {b^{- 1} Z_{u}^{⊤} W_{u} {(ϵ_{1}, \dots, ϵ_{n})}^{⊤}} = (\begin{matrix} 1 & 0 \\ 0 & 1 / h \end{matrix}) (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i} \\ \frac{1}{n} \sum_{i = 1}^{n} (u_{i} - u) K_{h} (u_{i} - u) ϵ_{i} \end{matrix}) = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i} \\ \frac{1}{n} \sum_{i = 1}^{n} (\frac{u_{i} - u}{h}) K_{h} (u_{i} - u) ϵ_{i} \end{matrix}) .

We now turn to $n^{- 1} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i}$ and find

ξ_{n}^{2} = var {\sum_{i = 1}^{n} \frac{1}{n} K_{h} (u_{i} - u) ϵ_{i}} = \frac{1}{n} E {K_{h}^{2} (u_{i} - u) ϵ_{i}^{2}} = \frac{σ^{2}}{n} E{K_{h}^{2} (u_{i} - u)} .

Note that

E {K_{h}^{2} (u_{i} - u)} = \int {h^{- 1} K (\frac{u_{i} - u}{h})}^{2} f (u_{i}) d u_{i} = h^{- 1} \int K^{2} (t) f (u + t h) d t = h^{- 1} \int K^{2} (t) {f (u) + t h f' (u) + o (h)} d t = h^{- 1} {f (u) ν_{0} + h f' (u) ν_{1} + o (h)} .

Thus $ξ_{n}^{2} = σ^{2} {f (u) ν_{0} + h f' (u) ν_{1} + o (h)} / (n h) = σ^{2} {C_{1} + o (h)} / (n h)$ . Similarly, we have

\sum_{i = 1}^{n} E | K_{h} (u_{i} - u) ϵ_{i} / n |^{3} = n^{- 2} E | K_{h}^{3} (u_{i} - u) | \times E | ϵ_{i}^{3} | = n^{- 2} h^{- 2} {C_{2} + o (h)} .

Therefore, as n → ∞, nh³ → ∞, we have nh → ∞, then

\frac{1}{ξ_{n}^{2}} \sum_{i = 1}^{n} E {| K_{h} (u_{i} - u) ϵ_{i} / n |}^{3} = \frac{{C_{2} + o (h)} / (n^{2} h^{2})}{σ^{2} {C_{2} + o (h)} / (n h)} = {C_{3} + o (h)} / (n h) \to 0,

By the Lyapunov Central Limit Theorem, we get

\frac{n^{- 1} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i}}{\sqrt{σ^{2} {f (u) ν_{0} + h f' (u) ν_{1} + o (h)} / (n h)}} ⇝ N (0, 1) .

That is,

\sqrt{h / n} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i} ⇝ N [0, σ^{2} f (u) ν_{0}] .

Similarly,

\sqrt{h / n} \sum_{i = 1}^{n} h^{- 1} (u_{i} - u) K_{h} (u_{i} - u) ϵ_{i} ⇝ N [0, σ^{2} f (u) ν_{2}] .

Applying Slutsky’s Lemma, we find

\sqrt{n h} {\hat{g} (u) - g (u) - g'' (u) μ_{2} h_{n}^{2} / 2} = \sqrt{n h} (1 0) H^{- 1} (n H S_{n}^{- 1} H) H^{- 1} {Z_{u}^{⊤} W_{u} {(ϵ_{1}, \dots, ϵ_{n})}^{⊤} / n} + O_{p} (\sqrt{h}) + o (\sqrt{n h^{5}}) = (1 0) H^{- 1} (n H S_{n}^{- 1} H) (\begin{matrix} \sqrt{h / n} \sum_{i = 1}^{n} K_{h} (u_{i} - u) ϵ_{i} \\ \sqrt{h / n} \sum_{i = 1}^{n} h^{- 1} (u_{i} - u) K_{h} (u_{i} - u) ϵ_{i} \end{matrix}) + O_{p} (\sqrt{h}) + o (\sqrt{n h^{5}}) ⇝ \frac{1}{f (u)} N [0, σ^{2} f (u) ν_{0}] = N [0, σ^{2} ν_{0} / f (u)],

which completes the proof of Theorem 2. □

6. Conclusion

In this paper, we advocated a new approach to select significant variables in the partially linear models via partial correlation learning. Under the partial faithfulness framework, the nonparametric smoothing techniques are adopted to obtain the partial residuals, and then the recursive hypothesis tests of partial correlations between partial residuals are conducted to select linear covariates in a backward direction. Model selection consistency is proved and empirically verified through simulations. Furthermore, the $\sqrt{n}$ -consistency of the estimated linear coefficients and the asymptotic normality of the nonparametric baseline estimations are provided. The performance of the method is further illustrated by supermarket data analysis.

Supplementary Material

Supplement

NIHMS1022758-supplement-Supplement.pdf^{(108.2KB, pdf)}

Acknowledgments

The authors are grateful to the Editor-in-Chief, the Associate Editor and the referees for comments and suggestions that led to significant improvements. Liu’s research was supported by National Natural Science Foundation of China (NNSFC) grants 11771361 and 11671334, JAS14007 and the Fundamental Research Funds for the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. Li’s research was supported by National Institute on Drug Abuse (NIDA) grants P50 DA039838 and P50 DA036107, and National Science Foundation grants DMS 1512422 and DMS 1820702. This work was also partially supported by NNSFC grant 11690015. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA, the NIH, or the NNSFC.

Footnotes

Online Supplement

The Online Supplement includes two parts. The first half provides the proof of Lemma 2 and the second half reports the additional simulation results mentioned in Section 4.

References

[1].Bühlmann P, Kalisch M, Maathuis M, Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]
[2].Chen J, Chen Z, Extended Bayesian information criterion for model selection with large model space, Biometrika 94 (2008) 759–771. [Google Scholar]
[3].Fan J, Gijbels I, Local Polynomial Modelling and its Applications, Chapman & Hall, London, 1996. [Google Scholar]
[4].Fan J, Huang T, Profile likelihood inferences on semiparametric varying-coefficient partially linear models, Bernoulli 11 (2005) 1031–1057. [Google Scholar]
[5].Fan J, Li R, Variable selection via nonconcave penalized likelihood and it oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]
[6].Fan J, Li R, New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, J. Amer. Statist. Assoc 99 (2004) 710–723. [Google Scholar]
[7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Heckman N, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B 48 (1986) 244–248. [Google Scholar]
[9].Li R, Liang H, Variable selection in semiparametric regression modeling, Ann. Statist 36 (2008) 261–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Li R, Liu J, Lou L, Variable selection via partial correlation, Statist. Sinica 27 (2017) 983–996. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Liu J, Li R, Wu R, Feature selection for varying coefficient models with ultrahigh dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Ruppert D, Sheather SJ, Wand MP, An effective bandwidth selector for local least squares regression, J. Amer. Statist. Assoc 90 (1995) 1257–1270. [Google Scholar]
[14].Ruppert D, Wand MP, Carroll RJ, Semiparametric Regression, Cambridge Press, New York, 2003. [Google Scholar]
[15].Speckman P, Kernel smoothing in partial linear models, J. R. Stat. Soc. Ser. B 50 (1988) 413–436. [Google Scholar]
[16].Tibshirani RJ, Regression shrinkage and selection via LASSO, J. R. Stat. Soc. Ser. B 58 (1996) 267–288. [Google Scholar]
[17].Wang H, Xia Y, Shrinkage estimation of the varying coefficient model, J. Amer. Statist. Assoc 104 (2009) 747–757. [Google Scholar]
[18].Xie H, Huang J, SCAD-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1022758-supplement-Supplement.pdf^{(108.2KB, pdf)}

[R1] [1].Bühlmann P, Kalisch M, Maathuis M, Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]

[R2] [2].Chen J, Chen Z, Extended Bayesian information criterion for model selection with large model space, Biometrika 94 (2008) 759–771. [Google Scholar]

[R3] [3].Fan J, Gijbels I, Local Polynomial Modelling and its Applications, Chapman & Hall, London, 1996. [Google Scholar]

[R4] [4].Fan J, Huang T, Profile likelihood inferences on semiparametric varying-coefficient partially linear models, Bernoulli 11 (2005) 1031–1057. [Google Scholar]

[R5] [5].Fan J, Li R, Variable selection via nonconcave penalized likelihood and it oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]

[R6] [6].Fan J, Li R, New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, J. Amer. Statist. Assoc 99 (2004) 710–723. [Google Scholar]

[R7] [7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Heckman N, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B 48 (1986) 244–248. [Google Scholar]

[R9] [9].Li R, Liang H, Variable selection in semiparametric regression modeling, Ann. Statist 36 (2008) 261–286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Li R, Liu J, Lou L, Variable selection via partial correlation, Statist. Sinica 27 (2017) 983–996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Liu J, Li R, Wu R, Feature selection for varying coefficient models with ultrahigh dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Ruppert D, Sheather SJ, Wand MP, An effective bandwidth selector for local least squares regression, J. Amer. Statist. Assoc 90 (1995) 1257–1270. [Google Scholar]

[R14] [14].Ruppert D, Wand MP, Carroll RJ, Semiparametric Regression, Cambridge Press, New York, 2003. [Google Scholar]

[R15] [15].Speckman P, Kernel smoothing in partial linear models, J. R. Stat. Soc. Ser. B 50 (1988) 413–436. [Google Scholar]

[R16] [16].Tibshirani RJ, Regression shrinkage and selection via LASSO, J. R. Stat. Soc. Ser. B 58 (1996) 267–288. [Google Scholar]

[R17] [17].Wang H, Xia Y, Shrinkage estimation of the varying coefficient model, J. Amer. Statist. Assoc 104 (2009) 747–757. [Google Scholar]

[R18] [18].Xie H, Huang J, SCAD-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]

PERMALINK

Variable selection for partially linear models via partial correlation

Jingyuan Liu

Lejia Lou

Runze Li

Abstract

1. Introduction

2. Partial faithfulness and partial correlations for the PLM

2.1. Partial faithfulness in partially linear models

2.2. Asymptotics of sample partial correlations for PLM

3. Variable selection via partial correlations of partial residuals

4. Numerical studies

4.1. Simulations studies

Table 1:

Table 2:

4.2. Supermarket data analysis

Table 3:

Figure 1.

5. Conditions and proofs

5.1. Regularity conditions and lemmas

5.2. Proof of Theorem 1

5.3. Proof of Theorem 2

6. Conclusion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Variable selection for partially linear models via partial correlation

Jingyuan Liu

Lejia Lou

Runze Li

Abstract

1. Introduction

2. Partial faithfulness and partial correlations for the PLM

2.1. Partial faithfulness in partially linear models

2.2. Asymptotics of sample partial correlations for PLM

3. Variable selection via partial correlations of partial residuals

4. Numerical studies

4.1. Simulations studies

Table 1:

Table 2:

4.2. Supermarket data analysis

Table 3:

Figure 1.

5. Conditions and proofs

5.1. Regularity conditions and lemmas

5.2. Proof of Theorem 1

5.3. Proof of Theorem 2

6. Conclusion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases