Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Sep 1.
Published in final edited form as: J Multivar Anal. 2018 Jun 20;167:18–434. doi: 10.1016/j.jmva.2018.06.005

Variable selection for partially linear models via partial correlation

Jingyuan Liu a, Lejia Lou b, Runze Li c,*
PMCID: PMC6555488  NIHMSID: NIHMS1022758  PMID: 31182883

Abstract

The partially linear model (PLM) is a useful semiparametric extension of the linear model that has been well studied in the statistical literature. This paper proposes a variable selection procedure for the PLM with ultrahigh dimensional predictors. The proposed method is different from the existing penalized least squares procedure in that it relies on partial correlation between the partial residuals of the response and the predictors. We systematically study the theoretical properties of the proposed procedure and prove its model consistency property. We further establish the root-n convergence of the estimator of the regression coefficients and the asymptotic normality of the estimate of the baseline function. We conduct Monte Carlo simulations to examine the finite-sample performance of the proposed procedure and illustrate the proposed method with a real data example.

Keywords: Model selection consistency, partial faithfulness, semiparametric regression modeling

1. Introduction

Let y be a response variable, u be a univariate continuous covariate and x = (x1, …, xp) be a p-dimensional covariate vector. The partially linear model (PLM) assumes that

y=g(u)+xβ+ϵ, (1)

where g is an unspecified baseline function, and β is a vector of unknown regression coefficients. The PLM thus assumes that the regression function linearly depends on the covariates x while depending nonparametrically on u. This model increases the flexibility of linear models by allowing the intercept to be a nonparametric function of the covariate u. It is one of the most popular semiparametric regression models in the literature [14].

This work aims to develop a variable selection procedure for the PLM with ultrahigh dimensional x, i.e, p=O{exp(na)} for some positive constant a, where n is the sample size. PLM estimation has been well studied in the case where p is finite and fixed; see, e.g., [8], [15]. Variable selection procedures have also been developed in this case, e.g., by Fan and Li [6] via penalized least squares, and by Liang and Li [11] who employed the penalized least squares method for variable selection in the PLM in the presence of error in variables. Xie and Huang [18] studied the penalized least squares method with the SCAD penalty [5] for variable selection in the PLM with p=o(n).

In this paper, we propose a new variable selection procedure for PLM. This procedure differs from the aforementioned penalized least squares methods in that it is a partial correlation learning procedure based on the notion of partial faithfulness that was first advocated by Buhlmann et al. [1] for normal linear models and further used for elliptical linear models in [10]. We first utilize partial residual techniques to eliminate the nonparametric baseline function, and then conduct variable selection by recursively testing the partial correlations between the partial residual of the response and that of the linear covariates. That is, we recursively compare the partial correlations with some threshold values, and therefore refer to this method as the thresholded partial correlation on partial residuals (TPC-PR). Thus, the TPC-PR can be carried out by using the algorithm proposed in [10].

This paper’s main purpose is to study the theoretical properties of the TPC-PR, and to ensure that partial correlation learning works properly for the PLM. Developing the asymptotic theory of the TPC-PR is challenging since we have to deal with the approximation errors due to nonparametric estimation involved in the partial residuals. Furthermore, we need to study the partial faithfulness under the PLM setting without assuming normality. We first establish the concentration inequality of the partial correlations of the partial residuals. We then prove the model selection consistency of the TPC-PR under the PLM with ultrahigh dimensional x. We further establish the n-consistency of the regression coefficient estimate and the asymptotic normality of the nonparametric baseline estimation.

The rest of the paper is organized as follows. In Section 2, we discuss partial faithfulness under the PLM setting, and systematically study the asymptotic theory of partial correlations between partial residuals. We propose the TPCPR in Section 3, and carefully study its theoretical properties. Section 4 provides the results of Monte Carlo studies and a real data example. Technical proofs are given in Section 5, along with the corresponding regularity conditions to facilitate the proofs. A conclusion is provided in Section 6.

2. Partial faithfulness and partial correlations for the PLM

In model (1), assume that the random error satisfies E(ϵ|u) = E(|xj) = 0 for all j ∈ {1, …, p} and E(ϵ2) < ∞. The objective is to recover the truly active model A={j{1,,p}:βj0} with cardinality |A|, as well as to estimate g(u) and the nonzero coefficients in β. As most variable selection procedures do, we impose here a sparsity assumption, namely that |A|=O(nb), where b is defined in Theorems 1 and 2.

2.1. Partial faithfulness in partially linear models

We first discuss a partial faithfulness assumption for the PLM in (1) as a theoretical basis for our proposed variable selection procedure. This assumption, initially formulated by Buhlmann et al. [1], states that if the partial correlation between a given predictor and the response is zero given some other predictors, then the correlation between this predictor and the response is also zero given all other predictors. However, when the nonparametric baseline function is taken into consideration in model (1), this assumption is not directly applicable. Thus we need to apply first a partial residual technique to deal with the nonparametric part. Specifically, note that E(y|u) = g(u) + E(x|u)β+ E(|u).

Model (1) can be written in the form

y*=x*β+ϵ*, (2)

where y* = y−E(y|u) and x*=(x1*,,xp*)=xE(x|u) are called partial residuals of y and x on u, and ϵ* = ϵE(ϵ|u). It is easy to show that when the covariance matrix of x*, denoted by Σx*, is positive definite, we have

β=Σx*1cov(y*,x*). (3)

Therefore, whatever j ∈ {1, …, p},

βj=0ρ(y*,xj*|xk*,k{j}C)=0, (4)

where ρ(z1,z2|z3) represents the partial correlation between z1 and z2 after regressing on z3, and {j}C = {1, …, j−1, j+ 1, …, p}. This provides the rationale for recovering the nonzero coefficients in A by evaluating partial correlations.

However, the computation of ρ(y*,xj*|xk*,k{j}C) is infeasible under the high-dimensional setting when p is large. To address this issue, we adapt the concept of partial faithfulness [1] specifically for PLM, based on which we can convert the problem of evaluating ρ(y*,xj*|xk*,k{j}C) to recursively assess the partial correlations with lower dimensions in a backward direction. The partial faithfulness of PLM is defined as follows.

Definition 1. The partially linear model (1) is said to be partially faithful if for every j ∈ {1, …, p}, ρ(y*,xj*|xS*)=0 for some S{j}C implies that ρ(y*,xj*|xk*,k{j}C)=0, where xj* and y* are defined in model (2), and xS*={xj*:jS} for some index set S.

To fully understand the implications of the partial faithfulness, recall that (4) indicates that for every j ∈ {1, …, p} in model (2), βj = 0 is equivalent to {ρ(y*,xj*|xk*,k{j}C)=0}. Therefore,

βj=0ρ(y*,xj*|xS*)=0forsomeS{j}C,

or equivalently,

βj0ρ(y*,xj*|xS*)0forsomeS{j}C.

That is, under the partial faithfulness assumption, the predictor xj may be removed if there exists one subset xS* such that x*j is no longer needed when xS* is in the model. When the subset is taken to be empty, the marginal correlation should also be non-trivial. Thus the partial faithfulness rules out the situation where some predictors are marginally uncorrelated with the response, but possess joint effects with other covariates. This coincides with the assumption of any sure screening procedure, initiated by [7].

Lemma 1 below provides sufficient conditions for partial faithfulness of PLM.

Lemma 1. Assume that

(A1) Σx* is positive definite for all u;

(A2) {βj;jA}~f(b)db, where f denotes the density on a subset of |A| of an absolutely continuous distribution with respect to Lebesgue measure.

Then (x*,y*) satisfies partial faithfulness almost surely with respect to the distribution generating non-zero regression coefficients.

Conditions (A1) and (A2) are inspired by [1]. Condition (A1) guarantees the identifiability of β due to (3). Condition (A2) may be interpreted from a Bayesian point of view. We can treat nonzero βjs as independently and identically distributed random variables from a population with a non-trivial density. This condition is mild in the sense that from a Bayesian perspective, the zero coefficients can arise in an arbitrary fashion. We remark here that though already mild, (A1) and (A2) may not be the weakest conditions to guarantee partial faithfulness.

Based on Lemma 1, in order to identify nonzero βjs, it is sufficient to test recursively the above partial correlations with index set S, with sequentially increasing cardinality |S|. Lemma 1 is a direct corollary from Theorem 1 in [1].

2.2. Asymptotics of sample partial correlations for PLM

The problem of comparing ρ(y*,xj*|xS*) with 0 becomes testing the null hypothesis H0:ρ(y*,xj*|xS*)=0 in practice. This requires to study the asymptotic performance of the estimated partial correlations ρ^(y*,xj*|xS*), which is computed through several estimated conditional means. We first apply local linear regression [3] to estimate E(y|u) and E(x|u) in y* and x* based on the random sample (u1,x1,y1),,(un,xn,yn). The smoothing matrix S(h) is computed as

S(h)=((1,0){Z(u1)W(u1,h)Z(u1)}1Z(u1)W(u1,h)(1,0){Z(un)W(un,h)Z(un)}1Z(un)W(un,h)),

where

Z(u)=(1u1u1unu)andW(u,h)=diag{Kh(u1u),,Kh(unu)},

with Kh(·) = K(·/h)/h, and K being a kernel function with bandwidth h. Then the sample of the partial residuals y* and x* can be obtained by

y^*={1S(hy)y,X^*=[{1S(h1)}X1,,{1S(hp)}Xp]}, (5)

where (X1, …, Xp) = X = (x1, …, xn), and y = (y1, …, yn), hy and hj are bandwidths for estimating E(y|u) and E(x1|u), …, E(xp|u). The marginal correlations between the partial residuals y* and x1*,,xp* are then estimated by the Pearson correlation between y^* and each column of X^*, viz.

ρ^(y*,xj*)={1S(hy)}y,{1S(hj)}Xj{1S(hy)}y{1S(hj)}Xj.

Following [1], the partial correlations can be computed recursively, for any kS, by

ρ^(y*,xj*|xS*)=ρ^(y*,xj*|xS\{k}*)ρ^(y*,xk*|xS\{k}*)ρ^(xj*,xk*|xS\{k}*)[{1ρ^2(y*,xk*|xS\{k}*)}{1ρ^2(xj*,xk*|xS\{k}*)}]1/2. (6)

Next, we discuss the asymptotic normality of the partial correlations ρ^(y*,xj*|xS*) under a partially linear model setting.

Lemma 2. For any j ∈ {1, …, p} and S{j}C, under regularity conditions (B1)–(B8) in Appendix A, we have

n{(ρ^(y*,xj*|xS*)ρ(y*,xj*|xS*)}N[0,(1+κ){1ρ2(y*,xj*|xS*)}2],

where κ is the marginal kurtosis for distribution of (u, x, y), and denotes convergence in distribution.

3. Variable selection via partial correlations of partial residuals

Lemma 1 states that identifying nonzero coefficients is equivalent to recursively testing H0:ρ(y*,xj*|xS*)=0 for different S. Lemma 2 further implies conducting Z test based on the asymptotically normal distribution of ρ^(y*,xj*|xS*). We claim βj = 0 and delete xj if only one S{j}C can be found such that H0 cannot be rejected for this S.

Specifically, first set S, and by Lemma 1, we can delete xj if H0 is not rejected for xj* and xS*=, and obtain the first-step active set A^[1]. Note that only marginal utilities are involved in this step, hence the procedure for obtaining A^[1] can be viewed as a feature screening technique for PLM. Among the candidate covariates xj in A^[1], we continue to assess the partial correlations given each individual xk, kA^[1]. The insignificant xjs are further deleted, and the second-step active set A^[2]A^[1] is obtained. Then the partial correlations given two and more covariates in the current active set are evaluated in a sequential fashion. The procedure naturally stops when the cardinality of the given covariate set exceeds that of the current active set. Then any model fitting techniques for PLM in literature can be applied for estimating the nonzero coefficients of the linear term, as well as the nonparametric baseline function. For the sake of simplicity, the least squared estimates β^js are computed for the nonzero coefficients, and the nonparametric function is estimated by g^(u)=S(yXβ^) with the plug-in β^. We summarize the whole procedure in Algorithm 1, in which we follow [10] to set

T(α,n,κ,|S|)=exp{21+κΦ1(1α/2)/n|S|1}1exp{21+κΦ1(1α/2)/n|S|1}+1andκ^=1pj=1p[n1i=1n(xijx¯j)43{n1i=1n(xijx¯j)2}21],

where Φ−1 is the inverse function of the cumulative distribution function of N(0,1), x¯j is the sample mean of the jthe element of x, and xij is the jth element of xi.

We remark here that bandwidths h used in the smoothing matrix S(h)s are chosen differently as hy and h1, …, hp for estimating the conditional means of y and every xj. Several bandwidth selection techniques can be adopted, and we use the plug-in method by [13] rather than the cross-validation method for saving computational cost. In addition, one needs to select the bandwidth again after the active predictors are detected and the nonparametric baseline function is refitted.

Next we discuss the theoretical properties of the proposed TPC-PR procedure. We first advocate the model selection consistency in the following theorem under the ultrahigh dimensional PLM setting.

Algorithm1TPC-PRProcedureforPLM_1.Computethesamplepartialresidualsy^*andX^*by(5).2.Setm=1,constructthefirst-stepestimatedactivesetbyevaluatingthemarginalcorrelationsbetweenpartialresiduals:A^[1]={j{1,,p}:|ρ^(y*,xj*)|>T(α,n,κ^,0)}.3.Letm=m+1.Establishthemth-stepestimatedactivesetasA^[m]={jA^[m1]:|ρ^(y*,xj*)|>T(α,n,κ^,m1)},SA^[m1]\{j}|S|=m1}.4.RepeatStep3until|A^[m]|m.5.Theestimatedcoefficientvectorβ^=(β^1,,β^p)isdefinedasfollows:β^j=0ifjA^[m];β^jistheleastsquaresestimatebyregressingthepartialresidualsforjA^[m].6.Obtaintheestimatednonparametricbaselinefunctionbyg^(u)=S(yXβ^)._¯

Theorem 1. Assume regularity conditions (B1)–(B8) in Appendix A for the partially linear model (1). Further assume the partial faithfulness from Definition 1. Then there exist a sequence α → 0 and constants a and b, where 0 < a + b < (1 − 2d)/5 with d ∈ (0,1/2) such that for p=O{exp(na)} and |A|=O(nb), we have Pr(A^[m]A)0, as n → ∞.

Theorem 1 implies that the selected model successfully captures the true one with probability tending to 1. In the proof of this theorem, we indeed show the probability of selected model being the true one tends to 1 at an exponential rate. The result is more challenging than [10] since the approximation error of partial residuals has to be taken into consideration. Theorem 2 below states the n-consistency of the estimated linear coefficients, as well as the asymptotical normality of the estimated nonparametric baseline function.

Theorem 2. Under the same conditions as in Theorem 1, for p=O{exp(na)} and |A|=O(nb), where a and b are defined as Theorem 1, we have β^β=Op(n1/2), where refers to the L2 norm. Furthermore,

nh{g^(u)g(u)g(u)μ2h2/2}N[0,σ2ν0/f(u)],

as n → ∞, μj=ujK(u)du, ν0=K2(u)du, and h is the bandwidth for computing the smoothing matrix for estimating the nonparametric function in the last step of the algorithm.

From Theorem 2, we further derive the asymptotic bias and variance of g^(u) at any given value of u, viz.

bias(g^(u))=g(u)μ2h2/2+o(h2)+Op(n1/2),var{g^(u)}=σ2ν0nhf(u){1+o(1)}.

Theorem 1–2 ensure the theoretical validity of using the selected TPC-PR model for subsequent inference.

4. Numerical studies

In this section, we conduct simulation studies to assess the finite-sample performance of TPC-PR and to empirically verify the theoretical properties stated in the last section. We then illustrate the proposed methodology on a real data example.

4.1. Simulations studies

We evaluate the performance of TPC-PR by comparing it to the penalized regression on partial residuals, with the SCAD penalty [5] and the LASSO penalty [16], respectively. That is, we transform the PLM to linear models via partial residual technique, followed by the penalized least squared estimation procedure. The nonparametric baseline is estimated in the same fashion as TPC-PR. Furthermore, the PC-simple algorithm proposed by [1] is also studied based on the partial residuals, and is denoted as PC-PR. The distinction between TPC-PR and PC-PR is that we take the kurtosis into consideration in TPC-PR, and hence the normality assumption is not necessary for conducting the algorithm. The PC-PR, however, relies heavily on the normal distribution of the error term when sequentially testing the partial correlations.

To further enhance the finite-sample performance of TPC-PR, we in practice may consider a fine tuning on the critical value. Specifically, we use cT(α,n,κ^,m) as the threshold, where c is the tuning parameter chosen by minimizing the extended Bayesian information criterion [2],

ln(σ^2)+df×ln(p)×ln(n)/n,

where σ^2 is the estimated error variance of the PLM and d f is the number of nonzero estimated coefficients. The modified TPC-PR is denoted by TPC-PR-EBIC.

We conduct the simulation study under three dimension settings: low dimension (p = 20), medium dimension (p = 200), and high dimension (p = 500), with sample size n = 200. For each setting, we consider two distributions: normal distribution and mixture normal distribution, to the covariate vector x. The normal samples with autoregressive correlation are generated in the following fashion. We first draw (x1, …, xp+1) from a multivariate normal distribution N(0,Σ), where Σ is the covariance matrix with correlation ρ|ij| between xi and xj, ρ = 0.5 and 0.8, and variance is taken to be 0.25 and 1 to represent strong and weak signal, respectively. Then let u = Φ(xp+1), where Φ is the cumulative distribution function of standard normal distribution, N(0,1). By doing this, u and x are generated correlated, and u follows a uniform distribution. In the same routine, we draw random samples from a mixture of two normal distributions, viz. 0.9N(0,Σ)+0.1N(0,9Σ). The true coefficient vector β = (3,1.5,0,0,2,0, …, 0), and hence A={1,2,5}. Finally, we define the nonparametric baseline function g(u) = u2 and sin(2πu) in two scenarios. The experiment is repeated 1000 times, and the following criteria are adopted to assess the performance of all procedures.

  1. For evaluating model selection consistency:
    • False positive number (FP): the number of zero coefficients erroneously detected to be nonzero, i.e.,
      FP=j=1p1(β^j0,βj=0).
    • True positive number (TP): the number of nonzero coefficients correctly detected to be nonzero, i.e.,
      TP=j=1p1(β^j0,βj=0).
    • Under-fit percentage (Under-fit): the proportion of missing at least one of truly active covariates in the linear part.
    • Correctly-fit percentage (Cor-fit): the proportion of identifying exactly the truly active set.
    • Over-fit percentage (Over-fit): the proportion of identifying all the truly active covariates, but including at least one inactive covariate erroneously.
  2. For evaluating the n-consistency of linear coefficients:
    • Model error (ME) due to the linear part: ME=E[{x(β^β)}2]=(β^β)cov(x)(β^β).
  3. For evaluating the performance of the estimated nonparametric baseline:
    • Square root of average squared errors (RASE) defined by
      RASE={1Ngk=1Ng{g^(νk)g(νk)}2}1/2,
      where ν1,,νNg are the grid points at which the functions are evaluated, and Ng is the number of grid points.

The medians of ME and RASE, along with the respective medians of their absolute deviations (Devi) among the 1000 simulations are recorded. For other criteria, we report the average over the 1000 simulations.

We present the simulation results in Table 1 for the high-dimensional case (p = 500) with mixture normal distribution imposed on the error term and the correlation ρ = 0.5, and in Table 2 for that with ρ = 0.8. The rest of results are attached in the Online Supplement.

Table 1:

Simulation results for mixture of normals when p = 500 and ρ = 0.5.

Method ME(Devi) TP FP Under-fit Cor-fit Over-fit RASE(Devi)
σ2 = 0.25, g(u) = u2

SCAD 0.0074 (0.0042) 3.000 1.245 0.000 0.640 0.360 0.0917 (0.0247)
LASSO 0.0446 (0.0120) 3.000 32.170 0.000 0.000 1.000 0.1027 (0.0267)
PC-PR 0.0096 (0.0051) 3.000 0.260 0.000 0.760 0.240 0.0893 (0.0255)
TPC-PR 0.0066 (0.0040) 3.000 0.005 0.000 0.995 0.005 0.0895 (0.0250)
TPC-PR-EBIC 0.0066 (0.0040) 3.000 0.000 0.000 1.000 0.000 0.0895 (0.0251)

σ2 = 0.25, g(u) = sin(2πu)

SCAD 0.0072 (0.0040) 3.000 1.215 0.000 0.650 0.350 0.1117 (0.0204)
LASSO 0.0649 (0.0213) 3.000 28.750 0.000 0.005 0.995 0.1541 (0.0322)
PC-PR 0.0142 (0.0100) 3.000 0.220 0.000 0.785 0.215 0.1351 (0.0279)
TPC-PR 0.0117 (0.0076) 2.990 0.015 0.010 0.990 0.000 0.1340 (0.0281)
TPC-PR-EBIC 0.0114 (0.0072) 3.000 0.000 0.000 1.000 0.000 0.1340 (0.0278)

σ2 = 1, g(u) = u2

SCAD 0.0342 (0.0182) 3.000 4.965 0.000 0.535 0.465 0.1803 (0.0525)
LASSO 0.1736 (0.0430) 3.000 42.945 0.000 0.000 1.000 0.1913 (0.0488)
PC-PR 0.0605 (0.0274) 3.000 0.935 0.000 0.290 0.710 0.1763 (0.0513)
TPC-PR 0.0248 (0.0134) 3.000 0.040 0.000 0.960 0.040 0.1737 (0.0513)
TPC-PR-EBIC 0.0241 (0.0134) 3.000 0.005 0.000 0.995 0.005 0.1737 (0.0508)

σ2 = 1, g(u) = sin(2πu)

SCAD 0.0329 (0.0170) 3.000 4.680 0.000 0.555 0.445 0.2010 (0.0395)
LASSO 0.1748 (0.0445) 3.000 41.800 0.000 0.000 1.000 0.2172 (0.0408)
PC-PR 0.0614 (0.0277) 3.000 0.920 0.000 0.295 0.705 0.1991 (0.0402)
TPC-PR 0.0257 (0.0140) 3.000 0.040 0.000 0.960 0.040 0.1968 (0.0402)
TPC-PR-EBIC 0.0250 (0.0136) 3.000 0.005 0.000 0.995 0.005 0.1947 (0.0419)

Table 2:

Simulation results for mixture of normals when p = 500 and ρ = 0.8.

Method ME(Devi) TP FP Under-fit Cor-fit Over-fit RASE(Devi)
σ2 = 0.25, g(u) = u2

SCAD 0.0100 (0.0057) 3.000 1.185 0.000 0.640 0.360 0.1121 (0.0316)
LASSO 0.0658 (0.0204) 3.000 29.150 0.000 0.000 1.000 0.1915 (0.0571)
PC-PR 0.0121 (0.0067) 3.000 0.195 0.000 0.815 0.185 0.1089 (0.0345)
TPC-PR 0.0107 (0.0064) 2.995 0.015 0.005 0.985 0.010 0.1089 (0.0332)
TPC-PR-EBIC 0.0101 (0.0060) 3.000 0.000 0.000 1.000 0.000 0.1084 (0.0328)

σ2 = 0.25, g(u) = sin(2πu)

SCAD 0.0102 (0.0060) 3.000 1.120 0.000 0.660 0.340 0.1272 (0.0293)
LASSO 0.0902 (0.0300) 3.000 32.095 0.000 0.000 1.000 0.2326 (0.0555)
PC-PR 0.0150 (0.0099) 2.975 0.215 0.025 0.800 0.175 0.1527 (0.0313)
TPC-PR 0.0122 (0.0084) 2.950 0.065 0.050 0.935 0.015 0.1529 (0.0318)
TPC-PR-EBIC 0.0117 (0.0078) 2.995 0.020 0.005 0.980 0.015 0.1512 (0.0304)

σ2 = 1, g(u) = u2

SCAD 0.0459 (0.0232) 3.000 4.160 0.000 0.515 0.485 0.2121 (0.0557)
LASSO 0.1736 (0.0430) 3.000 42.945 0.000 0.000 1.000 0.1913 (0.0488)
PC-PR 0.0605 (0.0274) 3.000 0.935 0.000 0.290 0.710 0.1763 (0.0513)
TPC-PR 0.0248 (0.0134) 3.000 0.040 0.000 0.960 0.040 0.1737 (0.0513)
TPC-PR-EBIC 0.0241 (0.0134) 3.000 0.005 0.000 0.995 0.005 0.1737 (0.0508)

σ2 = 1, g(u) = sin(2πu)

SCAD 0.0456 (0.0235) 3.000 4.120 0.000 0.540 0.460 0.2277 (0.0504)
LASSO 0.2464 (0.0676) 3.000 37.635 0.000 0.000 1.000 0.3521 (0.0853)
PC-PR 0.0595 (0.0307) 3.000 0.730 0.000 0.430 0.570 0.2317 (0.0504)
TPC-PR 0.0343 (0.0193) 2.970 0.055 0.030 0.945 0.020 0.2296 (0.0550)
TPC-PR-EBIC 0.0327 (0.0176) 3.000 0.000 0.000 1.000 0.000 0.2290 (0.0544)

From Tables 12, the methods developed in this paper (TPC-PR and TPC-PR-EBIC) can successfully identify the three truly active covariates (TP ≈ 3), with fairly low average numbers of falsely including inactive covariates (FP ≈ 0). Similarly, the correct-fit rates approach 1, illustrating the model selection consistency. The under-fit and over-fit rate are both close to 0. The model error (ME) and the square root of average squared errors (RASE) are small enough to show the n-consistency of the coefficient estimations and the validity of the nonparametric baseline estimation.

In terms of comparison, the selection methods based on partial correlations favors sparser models than the penalized regression approaches in general. According to Tables 1 and 2, the last three methods consistently outperform the first two penalized regression approaches, especially LASSO, which, as expected, overfits data and yields conservative models. Although the correct-fit probability for SCAD (around 0.6) is much larger than LASSO, but is still left behind by the other three methods. The ME and RASE illustrate the same phenomenon, no matter which functional form of the baseline g(u) is assumed.

Among the three partial-correlation-based methods, PC-PR yields the worst results, especially when the noise-tosignal ratio is high (σ2 = 1): PC-PR yields only about 29% among all experiments that can identify the true model, while the TPC-PR and TPC-PR-EBIC both exceed 95%. This is due to the fact that PC-PR uses the wrong variance estimation for the testing statistics when normality assumptions are not satisfied. Meanwhile, TPC-PR involves the kurtosis into the limiting distribution of the partial correlations, and hence corrects the variance estimations. Under the normal distribution setting (the corresponding results are provided in the Online Supplement), PC-PR and TPC-PR perform similarly, since the asymptotic distributions are identical for the two testing statistics under normality assumption. TPC-PR with EBIC fine tuning indeed enhances the finite-sample performances, although TPC-PR already behaves satisfactory under most circumstances and is sufficient for practical application. The same pattern can be observed for both signal-to-noise ratios and both values that correlation ρ takes.

Comparing Tables 1 and 2, we observe that as ρ increases from 0.5 to 0.8, it becomes slightly more challenging to identify the true covariates. The model tends to be overfitted and the covariates are highly correlated with each other. And in the high-correlation scenario, the improvement by TPC-PR-EBIC is more significant. Finally, when the signal-to-noise ratio increases from 0.25 to 1, PC-PR works dramatically worse, but the other methods behave relatively robust. We have also compared computing times of these methods. In general, the LASSO method uses the least computing time. For a simulated data set with p = 500, the TPC-PR takes about 20 seconds, while the LASSO takes 0.1 second, and the one-step SCAD takes about 3 seconds. It is clear that the TPC-PR is slower than the LASSO and the one-step SCAD due to the recursive nature of the proposed Algorithm 1. However, TPC-PR can be carried out with a reasonable amount of time.

4.2. Supermarket data analysis

In this section, we apply the TPC-PR method to analyze a high dimensional data set from a supermarket. The data set consists of 464 daily records of the number of customers entering the supermarket, as well as the sales volume of 6398 products in the market. We suspect nonlinear relation between the dates and popularity of the store, thus a partially linear model is a plausible choice for fitting the data.

We apply SCAD, LASSO, PC-PR, TPC-PR, and TPC-PR-EBIC as the stock example. The model sizes and the prediction errors are reported in Table 3. SCAD and LASSO still yields much more conservative models with 28 and 39 selected variables than the partial correlation based methods, while the corresponding prediction errors are higher. Compared with PC-PR and TPC-PR, the TPC-PR-EBIC model is even sparser, with a slight sacrifice of prediction error. The time effect on the number of customers are depicted in Figure 1. Some periodic pattern is observed.

Table 3:

Model sizes and prediction errors for the market data analysis.

Approach Model Size Prediction Error
SCAD 28 16.87
LASSO 39 17.53
PC-PR 10 16.36
TPC-PR 10 16.36
TPC-PR-EBIC 7 16.97

Figure 1.

Figure 1

The estimated curve of the number of customers against dates.

5. Conditions and proofs

5.1. Regularity conditions and lemmas

The following regularity conditions are imposed to facilitate the proofs.

(B1) For j ∈ {1, …, p}, the conditional expectations E(y|u), E(xj|u), E(yxj|u), E(y2|u), and E(xj2|u) are all uniformly bounded in U, where U is the bounded support of u. Furthermore, we assume there exists δ1 > 0 such that (i) E{var(y|u)} ≥ δ1 and (ii) E{var(xj|u)} ≥ δ1.

(B2) xj and y satisfy the sub-exponential tail probability uniformly in u. That is, there exists s0 > 0 such that for s ∈ (0, s0),

supuUmaxj{1,,p}E{exp(sxj2)|u}<,supuUmaxj{1,,p}E{exp(sxjy)|u}<,supuUE{exp(sy2)|u}<.

(B3) The partial correlations ρ(y*,xj*|xS*) satisfy

inf{|ρ(y*,xj*|xS*)|:j{1,,p},S{j}C,|S||A|,ρ(y*,xj*|xS*)0}cn,

where cn=O(nd), and d ∈ (0,1/2).

(B4) The partial correlations ρ(y*,xj*|xS*) and ρ(xj*,xk*|xS*) satisfy

  1. sup{|ρ(y*,xj*|xS*)|:j{1,,p},S{j}C,|S||A|}τ<1

  2. sup{|ρ(xi*,xj*|xS*)|:i,j{1,,p},ij,S{i,j}C,|S||A|}τ<1.

(B5) Let f(u) be the density function of uU. Assume f(u) is bounded from 0; and f(u), its first derivative f′(u) and second derivative f′′(u) are bounded uniformly in u. That is, there exist δ3 > 0 and M1 > 0 such that

infuU|f(u)|δ3,supuU|f(u)|M1,supuU|f(u)|M1,supuU|f(u)|M1.

(B6) The kernel function K is a symmetric density function, and it is bounded and has finite second moment, i.e.,

supuU|K(u)|<,anduUu2K(u)du<.

(B7) (v, x ,y) has an elliptical distribution ECp+2(µ, Σ, ϕ), and there exists a function ψ such that the σ-field generated by u = ψ(v) is the same as that generated by v.

(B8) The bandwidths satisfy hy → 0, hj → 0, nhy3, and nhj3, for all j ∈ {1, …, p}.

We further state some lemmas to facilitate the technical proofs of theorem. The proofs of Lemma 3 to 5 below follow the same techniques as in [12].

Lemma 3. We adopt the following notation for simplicity:

Z1(u,h)=1ni=1nKh(uiu),Z2(u,h)=1ni=1n(uiuh)Kh(uiu),Z3(u,h)=1ni=1n(uiuh)2Kh(uiu),Z4(u,h)=1ni=1nxijKh(uiu),Z5(u,h)=1ni=1nxij(uiuh)Kh(uiu),Z6(u,h)=1ni=1nyiKh(uiu),Z7(u,h)=1ni=1nyi(uiuh)Kh(uiu).

Then under conditions (B1), (B3)–(B6) and (B8), for some small s > 0, and any ϵ > 0, we have the following results:

  1. supuUPr{|Z1(u,h)f(u)|>ϵ}4(1sϵ/4)n,

  2. supuUPr{|Z2(u,h)f(u)+tK(t)dt|>ϵ}4(1sϵ/4)n,

  3. supuUPr{|Z3(u,h)f(u)+t2K(t)dt|>ϵ}4(1sϵ/4)n,

  4. supuUPr{|Z4(u,h)f(u)E(xj|u)|>ϵ}4(1sϵ/4)n,

  5. supuUPr{|Z5(u,h)f(u)E(xj|u)+tK(t)dt|>ϵ}4(1sϵ/4)n,

  6. supuUPr{|Z6(u,h)f(u)E(y|u)|>ϵ}4(1sϵ/4)n,

  7. supuUPr{|Z7(u,h)f(u)E(y|u)+tK(t)dt|>ϵ}4(1sϵ/4)n.

Lemma 4. Assume A(u) and B(u) are two uniformly bounded functions of u. That is, there exist M4 and M5 such that

supuU|A(u)|M4,supuU|B(u)|M5.

For any given u, A^(u) and B^(u) are estimates of A(u) and B(u) based on n samples. Suppose there exist C1, …, C4 and q > 0 such that

supuUPr{|A^(u)A(u)|>ϵ}C1n{(1C2ϵ2/n4q)n+exp(C3nq)},supuUPr{|B^(u)B(u)|>ϵ}C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.

Then

supuUPr{|A^(u)B^(u)A(u)B(u)|>ϵ}C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.

Furthermore, if infuU|B(u)|δ3>0, then

supuUPr{|A^(u)/B^(u)A(u)/B(u)|>ϵ}C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.supuUPr{|{B^(u)}1/2{B(u)}1/2|>ϵ}C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.

Lemma 5. Define

Wj(u,h)=Z3(u,h)Z4(u,h)Z2(u,h)Z5(u,h)Z1(u,h)Z3(u,h)Z2(u,h)Z2(u,h),V(u,h)=Z3(u,h)Z6(u,h)Z2(u,h)Z7(u,h)Z1(u,h)Z3(u,h)Z2(u,h)Z2(u,h).

Then under the same conditions as Lemma 3, we have

  1. supuUPr{|Wj(u,h)E(xj|u)|>ϵ}4(1sϵ/4)n.

  2. supuUPr{|V(u,h)E(y|u)>ϵ|}4(1sϵ/4)n.

5.2. Proof of Theorem 1

We divide the proof into six steps.

Step 1: First note that

ρ^(y*,xj*|xk*)=ρ^(y*,xj*)ρ^(y*,xk*)ρ^(xj*,xk*)[{1ρ^2(y*,xk*)}{1ρ^2(xj*,xk*)}]1/2

is a function of ρ^(y*,xj*), ρ^(y*,xk*), and ρ^(xj*,xk*). Let g(x,y,z)=(xyz)/(1y2)(1z2), and x, y, z ∈ (−1,1), then all the first and second derivatives are bounded away from 1, given y and z are bounded away from 1.

By Theorem 1 in [10],

Pr{|ρ^(y*,xj*|xk*)ρ(y*,xj*|xk*)|>ϵ}=Pr{|g{ρ^(y*,xj*),ρ^(y*,xk*),ρ^(xj*,xk*)}g{ρ(y*,xj*),ρ(y*,xk*),ρ(xj*,xk*)}|>ϵ}Pr{(ρ^(y*,xj*)ρ^(y*,xk*)ρ^(xj*,xk*))(ρ(y*,xj*)ρ(y*,xk*)ρ(xj*,xk*))2>Cϵ}Pr{|ρ^(y*,xj*)ρ(y*,xj*)|>Cϵ/3}+Pr{|ρ^(y*,xk*)ρ(y*,xk*)|>Cϵ/3}+Pr{|ρ^(xj*,xk*)ρ(xj*,xk*)|>Cϵ/3}3C1n{(1C2ϵ2/n4q)+exp(C3nq)}.

Similarly, for any S{j}C, and |S||A|, we can have

Pr{|ρ^(y*,xj*|xS*)ρ(y*,xj*|xS*)|>ϵ}3|S|C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.

Step 2: If the distribution is assumed to be elliptical, we can use the sample version of the marginal kurtosis, viz.

κ^=1nj=1p[1+1ni=1n(x˜ijx˜¯j)4/{3ni=1n(x˜ijx˜¯j)2}2],

to estimate the kurtosis. Similar to the proof in Step 1, we can obtain the following inequality:

Pr(|κ^κ|>ϵ)C4exp(C3n14qϵ2)+C2nexp(C1nq).

Step 3: To study Pr(|Z^n(y*,xj*|xS*)/1+κ^Zn(y*,xj*|xS*)/1+κ|>ϵ), define g2(u,v)=ln{(1+u)}/(1u)}/(21+v) for all u ∈ (−1,1) and v ∈ (−1, ∞). Then

Z^n(y*,xj*|xS*)/1+κ^=g2{ρ^(y*,xj*|xS*),κ^}andZn(y*,xj*|xS*)/1+κ=g2{ρ(y*,xj*|xS*),κ}.

All the first and second derivatives are continuous and bounded for u ∈ (−τ, τ),v ∈ (−δ, +∞). By Theorem 1 in [10], under (B3) and (B5),

Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|>ϵ}Pr{(ρ^(y*,xj*|xS*)κ^)(ρ(y*,xj*|xS*)κ)>Cϵ}Pr(|ρ^(y*,xj*|xS*)ρ(y*,xj*|xS*)|>Cϵ/2)+Pr(|κ^κ|>Cϵ/2)(3|S|+1)C1n{(1C2ϵ2/n4q)n+exp(C3nq)}3|S|C1n{(1C2ϵ2/n4q)n+exp(C3nq)}.

Step 4: Next we compute Pr(Ej|S). When testing the jth predictor given S{j}C, denote the event

Ej|S={anerroroccurswhentestingρ(y*,xj*|xS*)=0}=Ej|SIEj|SII,

where Ej|SI denotes Type I error while Ej|SII represents Type II error. We have

Ej|SI={(n|S|1)1/2|Z^n(y*,xj*|xS*)/1+κ^|>Φ1(1α/2)whenZn(y*,xj*|xS*)=0}.

Then

Pr(Ej|SI)=Pr{(nS|1)1/2|Z^n(y*,xj*|xS*)1+κ^|>Φ1(1α/2)whenZn(y*,xj*|xS*)=0}Pr{(n|S|1)1/2|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|>Φ1(1α/2)}Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|>nn|S|1cn21+κ}Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|>cn21+κ}3|S|C1n{(1C2cn2/n4q)n+exp(C3nq)},

by choosing α=2{1Φ(cnn/(1+κ)/2)}. Furthermore,

Ej|SII={(n|S|1)1/2|Z^n(y*,xj*|xS*)/1+κ^|Φ1(1α/2)whenZn(y*,xj*|xS*)0}.

By choosing α=2[1Φ{n/(1+κ)cn/2}], we can get the following inequality:

Pr(Ej|SII)=Pr{(n|S|1)1/2|Z^n(y*,xj*|xS*)/1+κ^|Φ1(1α/2)whenZn(y*,xj*|xS*)0}=Pr{|Z^n(y*,xj*|xS*)1+κ^|nn|S|1cn21+κwhenZn(y*,xj*|xS*)0}Pr{|Zn(y*,xj*|xS*)1+κ||Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|nn|S|1cn21+κ}=Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ^||Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ||Zn(y*,xj*|xS*)1+κ|nn|S|1cn21+κ}.

Let g3(u) = (1/2) × ln{(1+u)/(1 − u)}, then |g3(u)| = |1/2 × ln{(1+u)/(1 − u)}| ≥ |u|, for all u ∈ (−1,1), and according to (B4), |ρ(y*,xj*|xS*)|cn, then

Pr(Ej|SII)Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|cn1+κnn|S|1cn21+κ}Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|cn1+κ(1nn|S|112)}Pr{|Z^n(y*,xj*|xS*)1+κ^Zn(y*,xj*|xS*)1+κ|3cn81+κ}3|S|C1n{(1C2cn2/n4q)n+exp(C3nq)},

as for large n, n/(n|S|1)5/4. Combining the above results, we have

Pr(Ej|S)=Pr(Ej|SI)+Pr(Ej|SII)3|S|C1n{(1C2cn2/n4q)n+exp(C3nq)}.

Step 5: To study Pr(A^[m]A), we consider all j ∈ {1, …, p} and all S{j}C subject to |S||A|, for any b > 0, under the partial faithfulness assumption and (B2), for j ∈ {1, …, p}, define Kj={S{j}C:|S||A|}. Now

Pr{A^[m]A}=Pr{anerroroccursforsomejandsomeS}=Pr{j{1,,p},SKjEj|S}j{1,,p},SKjPr(Ej|S)p×p|A|supj{1,,p},SKjPr(Ej|S)3|A|p|A|+1C1n{(1C2cn2/n4q)n+exp(C3nq)}.

The second last inequality holds since the number of possible choices of j is p, and there are p|A| possible choices for S. Furthermore, similar to Lemma 3 in [1], we can show that Pr(m=|A|)1.

Step 6: Under (B2), for p=O{exp(na)}=C4exp(na), and |A|=O(nb)=C5nb, we have

Pr(A^[m]A)3|A|C1n(p)|A|+1C1n{(1C2cn2/n4q)n+exp(C3nq)}.

Therefore,

ln{Pr(A^[m]A)}|A|ln3+(|A|+1)ln(p)+ln(C1n)+ln{(1C2cn2/n4q)n+exp(C3nq)}.C1(|A|+1)ln(p)+ln{{(1C2cn2/n4q)n+exp(C3nq)}}C1(C5nb+1)(lnC4+na)+ln{(1C2cn2/n4q)n+exp(C3nq)}.

Note that ex ≤ 1 − x/2, for x ∈ [0,ln2]. Since q ∈ (0,1), then C3/n1−q ∈ [0, ln 2].

exp(C3nq)={exp(C3nq/n)}n(1C2/2n1q)n=(1C3/n1q)n.

Therefore,

ln{Pr(A^[m]A)}C1(C5nb+1)(lnC4+na)+ln{1C2/n2d+4q)n+(1C3/n1q)n}.

Since 2d + 5(a + b) < 1, then (1 – 2d)/5 < (1 – ab 2d)/4, then n4q<n(1ab2d)/4, then

na+bnln{(1C2/n2d+4q)}na+bnC2/n2d+4q{1+o(1)}~C2n4qn1ab2d0

and

C1nln{(1C2/n2d+4q)}C1nC2/n2d+4q{1+o(1)}.

Thus,

ln{Pr(A^[m]A)C1na+b+nln(1C2/n2d+4q)}C1nln{(1C2/n2d+4q)}{na+bnln{(1C2/n2d+4q)}+1}.

As a result, Pr(A^[m]A)0. This complete the proof of Theorem 1. □

5.3. Proof of Theorem 2

Since we are applying the least squares on the estimated active set, then as n → ∞, β^β=Op(n1/2) when A^[m]=A. And according to Theorem 1, Pr(A^[m]=A)1. Thus β^β=Op(n1/2).

We next focus on the asymptotic normality of the nonparametric estimation of g(u). We complete the proof in three steps.

Step 1: First we derive the bias of g^(u). After obtaining the estimated active set, for any fixed u, we have

b^(u)(b^0(u)b^1(u))=arg minb0,b1i=1n{yixiβ^b0b1(uiu0)}2Kh(uiu0)=(ZuWuZu)1ZuWu(yXβ^),

where

Zu=(1u1u1unu)andWu=diag{Kh(u1u),Kh(unu)}.

We note the following facts:

  1. E(yXβ|X, u) = g(u).

  2. (ZuWuZu)1ZuWuX=Op(1), and β^β=Op(n1/2).

  3. g(ui) = b0 + b1(uiu) + b2(uiu)2 + · · ·, and g(u) = b0.

Now

E(b^|X,u)=(ZuWuZu)1ZuWuE(yXβ^|X,u)=(ZuWuZu)1ZuWuE(y+XβXβ^|X,u)=(ZuWuZu)1ZuWu{g(u)+XE(ββ^|X,u)}=(ZuWuZu)1ZuWu(b0+b1(u1u)+g(u)b0b1(u1u)b0+b1(u1u)+g(u)b0b1(u1u))+Op(n1/2)=(b0b1)+(ZuWuZu)1ZuWu(b2(u1u)2+b2(u2u)2+)+Op(n1/2)=(b0b1)+J1+Op(n1/2).

Set

Sn=ZuWuZu=(i=1nKh(uiu)i=1n(uiu)Kh(uiu)i=1n(uiu)Kh(uiu)i=1n(uiu)2Kh(uiu)),H=(100h).

Then

1nH1SnH1=(1ni=1nKh(uiu)1nhi=1n(uiu)Kh(uiu)1nhi=1n(uiu)Kh(uiu)1nh2i=1n(uiu)2Kh(uiu))=f(u)(100μ2)+O{h+1/(nh)}.

Next,

ZuWu((u1u)l(unu)l)=(nhlni=1n(uiuh)lKh(uiu)nhl+1ni=1n(uiuh)l+1Kh(uiu)).

Then J1 can be expressed as follows:

J1=(ZuWuZu)1ZuWu(b2(u1u)2+b2(unu)2+)=b2Sn1(nh21ni=1n(uiuh)2Kh(uiu)nh31ni=1n(uiuh)3Kh(uiu))+o(h2)=b2H1(n1H1SnH1)(h2ni=1n(uiuh)2Kh(uiu)h3ni=1n(uiuh)3Kh(uiu))+o(h2)=b2h2H1(n1H1SnH1)(1ni=1n(uiuh)2Kh(uiu)1ni=1n(uiuh)3Kh(uiu))+o(h2)=b2h2H1[f1(u)(1001/μ2)+Op{h+1/(nh)}][f(u)(μ2μ3)+Op{h+1/(nh)}]+o(h2)=b2h2H1[(μ2μ3/μ2)+Op{h+1/(nh)}]+o(h2).

Thus,

E(g^(u)|X,u)=(10)E(b^|X,u)=b0+b2h2[μ2+Op{h+1/(nh)}]+o(h2)+Op(n1/2)=b0+b2μ2h2+h3Op{1+1/(nh3)}+o(h2)+Op(n1/2)=b0+b2μ2h2+o(h2)+Op(n1/2),

and the last equality holds because nh3 → ∞, as n → 0. Note that the right-hand side does not depend on X and u.

Therefore,

E{g^(u)}=b0+b2μ2h2+o(h2)+Op(n1/2)=g(u)+g(u)μ2h2/2+o(h2)+Op(n1/2).

Step 2: Derive the variance of g^(u). Note that var(b^|X,u)=(ZuWuZu)1ZuWuvar(yXβ^|X,u)WuZu(ZuWuZu)1, and

var(yXβ^|X,u)=var(yXβ^|XβXβ^|X,u)=var(yXβ|X,u)+var(XβXβ^|X,u)+2cov(yXβ,XβXβ^|X,u)={σ2+Op(n1/2)}In.

Then

var(b^|X,u)=(ZuWuZu)1ZuWu{σ2In+Op(n1/2)}WuZu(ZuWuZu)1={σ2+Op(n1/2)}Sn1(ZuWu2Zu)Sn1={σ2+Op(n1/2)}H1(H1SnH1)H1(ZuWu2Zu)H1(H1SnH1)1{σ2+Op(n1/2)}J2.

Furthermore,

H1(ZuWu2Zu)H1=(1011/h)(i=1nKh2(uiu)i=1n(uiu)Kh2(uiu)i=1n(uiu)Kh2(uiu)i=1n(uiu)2Kh2(uiu))(1011/h)=(i=1nKh2(uiu)i=1n(uiuh)Kh2(uiu)i=1n(uiuh)Kh2(uiu)i=1n(uiuh)2Kh2(uiu))=nf(u)h(ν0ν1ν1ν2)+nOp{1+1/(nh3)}

and

J2=1nH1(n1H1SnH1)1{n1H1(ZuWu2Zu)H1}(n1H1SnH1)H1=1nH1[f1(u)(1001/μ2)+Op{h+1/(nh)}][f(u)h(ν0ν1ν1ν2)+Op{1+1/(nh)3}]×[f1(u)(1001/μ2)+Op{h+1/(nh)}]H1=1nhH1[1f(u)(ν0ν1/μ2ν1/μ2ν2/μ2)+Op{h+1/(nh)}]H1.

Therefore,

var{g^(u)|X,u}=(10)var(b^|X,u)(10)=σ2+Op(n1/2)nh[1f(u)ν0+Op{h+1/(nh)}]=1nhf(u)[σ2ν0+Op(n1/2)+Op{h+1/(nh)}].

Notice that the right-hand side does not depend on X and u. Therefore,

var{g^(u)}=1nhf(u)[σ2ν0+Op(n1/2)+Op{h+1/(nh)}].

Step 3: In order to derive the asymptotic distribution of g^(u), we can use the following facts:

  1. g^(u)=(10)b^=(10)(ZuWuZu)1ZuWu(yXβ^).

  2. E{g^(u)|X,u}=(10)E(b^|X,u)=g(u)+g(u)μ2h2/2+Op(n1/2)+o(h2).

  3. E{g^(u)|X,u}=(10)(ZuWuZu)1ZuWuE(yXβ^|X,u).

  4. yXβE(yXβ|X,u)=(ϵ1,,ϵn).

We first study g^(u)+g(u)μ2h2/2:

g^(u)g(u)g(u)μ2h2/2=g^(u)E(g^(u)|X,u)+Op(n1/2)+o(h2)=(10)1(ZuWuZu)ZuWu{yXβ^E(yXβ^|X,u)+Op(n1/2)+o(h2)}=(10)Sn1ZuWu{yXβE(yXβ|X,u)+X(ββ^)E(X(ββ^)|X,u)+Op(n1/2)+o(h2).

Because Sn1ZuWuX=Op(1), β^β=Op(n1/2), then

g^(u)g(u)g(u)μ2h2/2=[10]Sn1ZuWu{yXβE(yXβ|X,u}+Op(n1/2)+o(h2)=[10]Sn1ZuWu(ϵ1,,ϵn)+Op(n1/2)+o(h2)=[10]H1(nHSn1H)H1{ZuWu(ϵ1,,ϵn)/n}+Op(n1/2)+o(h2).

Furthermore, given that

n1H1SnH1Pf(u)(100μ2)

we have

nHSn1HP1f(u)(1001/μ2).

Moreover,

H1{b1ZuWu(ϵ1,,ϵn)}=(1001/h)(1ni=1nKh(uiu)ϵi1ni=1n(uiu)Kh(uiu)ϵi)=(1ni=1nKh(uiu)ϵi1ni=1n(uiuh)Kh(uiu)ϵi).

We now turn to n1i=1nKh(uiu)ϵi and find

ξn2=var{i=1n1nKh(uiu)ϵi}=1nE{Kh2(uiu)ϵi2}=σ2nE{Kh2(uiu)}.

Note that

E{Kh2(uiu)}={h1K(uiuh)}2f(ui)dui=h1K2(t)f(u+th)dt=h1K2(t){f(u)+thf(u)+o(h)}dt=h1{f(u)ν0+hf(u)ν1+o(h)}.

Thus ξn2=σ2{f(u)ν0+hf(u)ν1+o(h)}/(nh)=σ2{C1+o(h)}/(nh). Similarly, we have

i=1nE|Kh(uiu)ϵi/n|3=n2E|Kh3(uiu)|×E|ϵi3|=n2h2{C2+o(h)}.

Therefore, as n → ∞, nh3 → ∞, we have nh → ∞, then

1ξn2i=1nE|Kh(uiu)ϵi/n|3={C2+o(h)}/(n2h2)σ2{C2+o(h)}/(nh)={C3+o(h)}/(nh)0,

By the Lyapunov Central Limit Theorem, we get

n1i=1nKh(uiu)ϵiσ2{f(u)ν0+hf(u)ν1+o(h)}/(nh)N(0,1).

That is,

h/ni=1nKh(uiu)ϵiN[0,σ2f(u)ν0].

Similarly,

h/ni=1nh1(uiu)Kh(uiu)ϵiN[0,σ2f(u)ν2].

Applying Slutsky’s Lemma, we find

nh{g^(u)g(u)g(u)μ2hn2/2}=nh(10)H1(nHSn1H)H1{ZuWu(ϵ1,,ϵn)/n}+Op(h)+o(nh5)=(10)H1(nHSn1H)(h/ni=1nKh(uiu)ϵih/ni=1nh1(uiu)Kh(uiu)ϵi)+Op(h)+o(nh5)1f(u)N[0,σ2f(u)ν0]=N[0,σ2ν0/f(u)],

which completes the proof of Theorem 2. □

6. Conclusion

In this paper, we advocated a new approach to select significant variables in the partially linear models via partial correlation learning. Under the partial faithfulness framework, the nonparametric smoothing techniques are adopted to obtain the partial residuals, and then the recursive hypothesis tests of partial correlations between partial residuals are conducted to select linear covariates in a backward direction. Model selection consistency is proved and empirically verified through simulations. Furthermore, the n-consistency of the estimated linear coefficients and the asymptotic normality of the nonparametric baseline estimations are provided. The performance of the method is further illustrated by supermarket data analysis.

Supplementary Material

Supplement

Acknowledgments

The authors are grateful to the Editor-in-Chief, the Associate Editor and the referees for comments and suggestions that led to significant improvements. Liu’s research was supported by National Natural Science Foundation of China (NNSFC) grants 11771361 and 11671334, JAS14007 and the Fundamental Research Funds for the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. Li’s research was supported by National Institute on Drug Abuse (NIDA) grants P50 DA039838 and P50 DA036107, and National Science Foundation grants DMS 1512422 and DMS 1820702. This work was also partially supported by NNSFC grant 11690015. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA, the NIH, or the NNSFC.

Footnotes

Online Supplement

The Online Supplement includes two parts. The first half provides the proof of Lemma 2 and the second half reports the additional simulation results mentioned in Section 4.

References

  • [1].Bühlmann P, Kalisch M, Maathuis M, Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]
  • [2].Chen J, Chen Z, Extended Bayesian information criterion for model selection with large model space, Biometrika 94 (2008) 759–771. [Google Scholar]
  • [3].Fan J, Gijbels I, Local Polynomial Modelling and its Applications, Chapman & Hall, London, 1996. [Google Scholar]
  • [4].Fan J, Huang T, Profile likelihood inferences on semiparametric varying-coefficient partially linear models, Bernoulli 11 (2005) 1031–1057. [Google Scholar]
  • [5].Fan J, Li R, Variable selection via nonconcave penalized likelihood and it oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]
  • [6].Fan J, Li R, New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis, J. Amer. Statist. Assoc 99 (2004) 710–723. [Google Scholar]
  • [7].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space (with discussion), J. R. Stat. Soc. Ser. B 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Heckman N, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B 48 (1986) 244–248. [Google Scholar]
  • [9].Li R, Liang H, Variable selection in semiparametric regression modeling, Ann. Statist 36 (2008) 261–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Li R, Liu J, Lou L, Variable selection via partial correlation, Statist. Sinica 27 (2017) 983–996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Liu J, Li R, Wu R, Feature selection for varying coefficient models with ultrahigh dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Ruppert D, Sheather SJ, Wand MP, An effective bandwidth selector for local least squares regression, J. Amer. Statist. Assoc 90 (1995) 1257–1270. [Google Scholar]
  • [14].Ruppert D, Wand MP, Carroll RJ, Semiparametric Regression, Cambridge Press, New York, 2003. [Google Scholar]
  • [15].Speckman P, Kernel smoothing in partial linear models, J. R. Stat. Soc. Ser. B 50 (1988) 413–436. [Google Scholar]
  • [16].Tibshirani RJ, Regression shrinkage and selection via LASSO, J. R. Stat. Soc. Ser. B 58 (1996) 267–288. [Google Scholar]
  • [17].Wang H, Xia Y, Shrinkage estimation of the varying coefficient model, J. Amer. Statist. Assoc 104 (2009) 747–757. [Google Scholar]
  • [18].Xie H, Huang J, SCAD-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES