Abstract
Most existing methods of variable selection in partially linear models (PLM) with ultrahigh dimensional covariates are based on partial residuals, which involve a two-step estimation procedure. While the estimation error produced in the first step may have an impact on the second step, multicollinearity among predictors adds additional challenges in the model selection procedure. In this paper, we propose a new Bayesian variable selection approach for PLM. This new proposal addresses those two issues simultaneously as (1) it is a one-step method which selects variables in PLM, even when the dimension of covariates increases at an exponential rate with the sample size, and (2) the method retains model selection consistency, and outperforms existing ones in the setting of highly correlated predictors. Distinguished from existing ones, our proposed procedure employs the difference-based method to reduce the impact from the estimation of the nonparametric component, and incorporates Bayesian subset modeling with diffusing prior (BSM-DP) to shrink the corresponding estimator in the linear component. The estimation is implemented by Gibbs sampling, and we prove that the posterior probability of the true model being selected converges to one asymptotically. Simulation studies support the theory and the efficiency of our methods as compared to other existing ones, followed by an application in a study of supermarket data.
Keywords: Bayesian variable selection, Difference-based method, Selection consistency, Semiparametric modeling 2010 MSC: Primary 62G08, Secondary 62J05
1. Introduction
Semiparametric model attracts considerable attention in the literature since it retains the interpretability of the parametric models and keeps some flexibility of the nonparametric models. In this paper, we study a type of commonly used semiparametric model, the partially linear model (PLM). The PLM assumes that, the response Y depends both linearly on some covariates of interest, and nonparametrically on another univariate continuous covariate U defined on [0, 1]. Suppose that the observed data {(Yi, Xi, Ui)}, i ∈ {1, …, n}, is a random sample from the following PLM
| (1) |
This PLM specifies a parsimonious linear function in the parametric part, while allowing a nonparametric component to be unconstrained and subject to empirical estimation. In this paper, a new one-step Bayesian approach is proposed to select variables for PLM with ultrahigh dimensional covariate X, that is ln p = o(n). Specifically, our proposed method simplifies the procedure by avoiding estimating the infinite dimensionality brought by the nonparametric component and results in sparsity in the linear component.
The estimation procedure for PLM with fixed dimension p of X has been extensively studied. Engle et al. used the penalized least squares method to estimate β and the nuisance function f(·) simultaneously by adding a penalty on the roughness of f(·), which was referred as the partial smoothing splines [4, 7, 12, 23]. Since β is of primary interest, some other methods brilliantly avoid the estimation of f(·). For example, Robinson [24] introduced a profile least squares estimator based on the idea of partial residual, which later became one of the commonly used approaches to eliminate the nonparametric component in PLM. Another type of approach to eliminate the nonparametric component is the difference-based method [27, 29]. It estimates the coefficients in linear component by taking differences of the ordered observations. The resulting estimator is proven to be asymptotically efficient under finite dimension. See Section 2.1 for more details about the difference-based method.
Variable selection for PLM can be accomplished by adding another penalty function on β to the loss function of the aforementioned partial smoothing splines method. The least absolute shrinkage and selection operator (LASSO) [25], the nonnegative garrote [2, 31], the smoothly clipped absolute deviation (SCAD) [8], the elastic net [34], and the minimum concave penalty (MCP) [32] are all among popular choices of penalty functions. Xie and Huang [28] used the SCAD penalty to achieve the sparsity in the linear part and used the polynomial splines to estimate the nonparametric component simultaneously. The resulting estimator was shown to be consistent with . An alternative is to use a two-step procedure, in which Y and X are first regressed on U separately to get partial residuals, then the variable selection is further applied on the transformed model. For example, the consistency for both the linear and the nonparametric components has been well studied by Zhu et al. [33] under the regime of . Liang and Li [19] discussed this approach in the presence of measurement errors. More recently, Liu et al. [20] proposed selection procedure via recursively test on the partial correlation among the partial residuals and among the covariates when ln p = o(n). The method was referred as the thresholded partial correlation on partial residuals (TP-CPR). However, to the best of our knowledge, there is nearly no literature on variable selection in the high-dimensional setting based on the extension of difference-based method.
Bayesian approach puts priors on the parameters and the model space, and selects the model with the highest posterior probability. There has been multiple development for variable selection using Bayesian approach with linear and generalized linear models. George and McCulloch [11] proposed a milestone method of Bayesian variable selection via stochastic search. They introduced a latent binary vector to indicate the inclusion of variables in linear models, and then placed a mixture spike and slab prior on each coefficient conditioning on this latent vector. Following this approach, many other selection procedures with similar structure have been proposed. The distinction between them is mostly in the form of the spike and slab priors, or in the form of the prior on the model space. To alleviate the difficulty in choosing specific prior parameters, several approaches have been proposed, see [10, 14, 30]. However, these papers focused on small-scaled questions and did not discuss any possible extension to the high-dimensional setting. More recently, Ishwaran and Rao [15] established the oracle property of the posterior mean as n converges to infinity with fixed p under certain conditions on the prior variances for linear models. Johnson and Rossell [16] proved selection consistency under p = O(n) for a non-local prior in linear model settings. Liang et al. [18] proposed a point-mass spike prior with a slab prior depending on the model size, and proved the posterior consistency under ln p = o(n) in generalized linear models, but the corresponding conditions are relatively strong. Additionally, the step-wise estimation procedure is not efficient. Narisetty and He [21] also used Gaussian prior but argued the prior should be sample-size dependent, referred to as Bayesian shrinking and diffusing priors (BASAD), and obtained strong selection consistency when ln p = o(n) for linear models under mild assumptions. However, BASAD is not computationally practical for large-p problems, since it requires to update β from a p-dimensional multivariate normal distribution in each iteration. Recently Narisetty et al. [22] proposed Skinny Gibbs (SG) algorithm to address this computation issue via sparsifying the precision matrix. They referred to this kind of update as Skinny Gibbs (SG) and argued that it is a scalable method, namely the required computation time grows approximately linearly in p. The selection consistency was proved for the logistic regression. While spike and slab priors have been widely used in applications for its attractive interpretability, the theory for spike and slab models has not caught up with the applications. Again, all the aforementioned papers focused on linear or generalized linear models, and the corresponding work on semiparametric or nonparametric models under high dimensional setting is limited.
In this paper, we propose a Bayesian subset selection procedure for the partially linear model. We incorporate the difference-based method in the prior for the nonparametric component. For the parametric component, we adopt a modified version of Bayesian shrinking and diffusing priors (BASAD) [21] and propose the novel Bayesian subset modeling with diffusing prior (BSM-DP). We use a normal distribution with a diverging variance as the slab prior and a normal distribution with a small variance as the spike prior. Differently from BASAD, the response variable in our model only depends on the active covariates. This conveniently allows us to sample coefficients separately for the active and the inactive sets during the estimation. In fact, the spike prior has no impact on the theoretical result, so any proposal including a point mass will work. As a practical note, we recommend a Gaussian distribution with a small variance, which allows more flexibility for the Markov chain to explore the model space, and hence avoiding local trap. As a result, the proposed methods are more computationally efficient than BASAD. We also notice that the Skinny Gibbs (SG) [22] is a special case of BSM-DP when the variance of spike prior is set to be proportional to 1/n. Their original paper [22] discussed logistic regression only. We establish the selection consistency for the parametric component in partially linear models when ln p = o(n) under mild conditions.
The rest of the paper is organized as follows. In Section 2 we present the Bayesian subset modeling with diffusing prior (BSM-DP) and discuss variable selection for partially linear model, followed by the estimation procedure, regularity conditions and theoretical results. Performances of several numerical studies are presented in Section 3 to demonstrate reliability of the proposed model. We further apply the proposed method on the supermarket data set. Proofs for lemmas and theorems are given in Section 4, followed by discussions in Section 5.
2. Bayesian subset modeling with diffusing prior
2.1. Model and notation
Suppose that {(Yi, Xi, Ui)}, i ∈ {1, …, n} is a random sample from PLM (1) with high-dimensional covariates and univariate covariate U ∈ [0, 1], where we use pn to emphasize that the number of variables is allowed to diverge with sample size n. Assume that the random error ϵ is independent of (X⊤, U) and each observation Xi has the same distribution with mean 0 and covariance Σ. Denote f(Ui) as αi, and α = (α1, …., αn)⊤ as a vector with size n. Notation Y is the corresponding size n vector, and X is the design matrix with size n × pn.
We will propose a prior for the nonparametric function (i.e. the α) in our proposed Bayesian subset selection based on difference-based method. Assume the observation {(Yi, Xi, Ui)}1≤i≤n is ordered by the increasing order of {Ui}1≤i≤n. The difference in observed value for contingent Y can be written as
If Ui−1 and Ui are close and f(·) is smooth enough, f(Ui) should also be close to f(Ui−1). So the nonparametric part tends to be cancelled out. In this case, the ordinary least squares estimate can be applied on the differenced data, as long as X is not perfectly correlated with U. Define the mth higher order difference sequence to be {di}i∈{1,…,m+1}, which satisfies and . So the mth order difference operation reduces the sample size to n – m by defining
for i ∈ {1, …, n − m}, where C is some positive constant. Define the difference matrix D as
| (2) |
Therefore the PLM (1) can be rewritten in matrix form as
where , , , . Under some smoothness conditions on f(·) with fixed p, Yatchew [29] and Wang et al. [27] showed that the ordinary least square estimator is asymptotically efficient when m → ∞, if X and U are independent. This indicates that the effect of the nonparametric component is negligible after applying a high order difference operation on the data.
In the literature, Ui’s are either from a fixed design e.g., Ui = i/n, or observations from a distribution on [0, 1] with density function bounded away from 0. In this paper, we only consider the case when X and U are independent under a dense design with a bounded max2≤i≤n |Ui − Ui−1|.
We use Xj as the notation for the jth covariate. A size-pn latent binary random vector is introduced as γ. The jth entry γj indicates whether Xj is included in the model (1 = present, 0 = not present). Therefore, the model space is fully specified by γ, and we use γ and as notations for models interchangeably. The true model is denoted as . The cardinality of model , denoted by , is the size of the model. Consequently, if is the subvector of β with size , is the submatrix of X with respect to model , and is the covariance matrix for . Other notations used in the paper are unified as follows.
Model operation: and are defined as the intersection and union of model and , for example .
Rate: an ⪯ bn or bn ⪰ an means an = O(bn), an ≺ bn or bn ≻ an means an = o(bn). And an ~ bn refers an/bn → c for some positive constant.
Matrix and matrix operation: n × n identity matrix is denoted as In. For a matrix M, ∥M∥ is the spectral norm, which is the largest singular value of M. The Moore-Penrose inverse of M is denoted by M+, which is the unique generalized inverse. And if M is a positive definite matrix we use λmin(M) and λmax(M) as the notation for the minimum and the maximum eigenvalue of M.
Here and hereafter, the densities are conditional on X and U. The working model for variable selection in the partially linear model in (1) via Bayesian subset modeling with diffusing prior (BSM-DP) is proposed as
| (3) |
where, , D is the difference matrix defined in (2).
We choose the classical Inverse Gamma distribution as the prior for σ2 as it is the most commonly used conjugate prior. Other choices of prior could be used, and it can be shown that Theorem 5 applies to a wider family of priors, including some commonly choices like improper non-informative prior and the class of folded-noncentral-t prior (see Remark 8). Meanwhile, independent Bernoulli distribution with probability qn is used as the prior for each γj. So the preliminary marginal inclusion probability for each variable is qn. It is natural to assume that when the dimension pn diverges with the sample size, qn should converge at some rate to 0. Each βi has a mixture normal distribution. Conditioning on γj = 1, βj has a normal distribution with a relative large variance . This corresponds to a very wide and flat distribution, usually referred to as a slab prior. We call it the diffusing prior as named in [21]. Within its variance, σ1n depends on the sample size, and diverges at some certain rate when the sample size goes to infinity. Conditioning on γj = 0, βj has a normal distribution with variance . As the choice of would not influence the asymptotic results, it can be chosen depending on sample size or simply as a fixed value.
With a partially linear model, we will need to accommodate the nonparametric part. A conjugate prior of normal distribution with a semi-definite covariance matrix σ2Σ0n is proposed for α = (f(U1), …, f(Un)). The covariance matrix σ2Σ0n is further taken as a function of the difference matrix D thus to eliminate the effect of nonparametric function. More intuitions about the choice of Σ0n will be discussed later. The error term is assumed to be normally distributed. Therefore, conditional on the latent indicator γ, coefficients β, nonparametric component α, and the variance for the error σ, Y has a normal distribution.
Remark 1. (Comparison with BASAD [21] and SG [22]). As mentioned earlier, the inclusion of γ in the conditional distribution of Y distinguishes our model from BASAD. This difference allows us to sample separately for the active and the inactive groups.
In our working model (3), the response variable Y is conditioned on γ, hence only depends on the active covariates Xγ. But in BASAD, Y depends on both the active and nonactive part of the covariates. As a result, the full conditional distribution for βγ and are not independent in BASAD. Therefore, to update β in MCMC, each iteration requires sampling a size-p vector from a multivariate normal distribution. This will increase the computational time quickly under large p. On the other hand, the full conditional distribution for βγ and are independent in our proposal. So in each iteration, we only need to sample a size-|γ| vector from a multivariate normal distribution and sample (p − |γ|) scalars from independent univariate normal distributions. The current active model size |γ| is usually small after several iterations if the true model is sparse. In fact, just as SG, the proposed BSM-DP is scalable in high dimensional problems, which means that the computation time growing approximately linearly with the dimension p. The computational complexity for each iteration in the estimation procedure is n(p ∨ |γ|2 ∨ n2). We have also validated this claim in the simulation study for PLM and more simulation studies about linear models as compared with BASAD and SG in the supplementary material. It can be shown that SG is equivalent to our Bayesian subset modelling by taking the variance of spike prior to be , where is the variance for spike prior in SG.
2.2. Estimation procedure
Gibbs sampling is used to update our parameters iteratively. In each iteration, we draw samples from those full conditional distributions.
-
Update γk from a Bernoulli distribution:
Full conditional distribution of γk is a Bernoulli distribution with probability Pr(γk = 1|β, α, σ2, Y, γ−k) = p1/ (p1 + p2), and
where Xi is the ith column of X, and the index is the collection of current active covariates after removing the kth covariate, that is . -
Update β from multivariate normal distributions:
In each iteration, we divide β into the active group and the inactive group based on the current γ. Denote as the collection of covariates with γj = 1, and as the collection of covariates with γj = 0. Rewrite , and we can update those two groups separately.- Update the active group , where .
- Update the inactive group .
-
Update σ2 from an Inverse Gamma distribution:
σ2 ~ IG(a, b), where - Update α from a multivariate normal distribution:
In the literature, the nonparametric function f(·) is usually assumed to be smooth, which means f(x) and f(y) should be close if x and y are close enough. This dependency among f(U1), f(U2), …, f(Un) suggests that the covariance matrix Σ0n of α has to be a dense matrix. Here we take Σ0n to be Σ0n = {(In – CD⊤D)−1 − In}+ where C is some positive constant and D is the difference matrix defined in (2). We will show the reason for this specific choice of Σ0n in Remark 2. Fig. 1. shows the intuitive structure of difference matrix D, the matrix Σαn used for the update of α, and covariance matrix for the prior Σ0n when constant C = 0.6 with sample size n = 200 and difference order m = 20. As demonstrated in the figure, is an general upper triangular band matrix with bandwidth m. The update matrix Σαn is also a band matrix with bandwidth m. The covariance matrix for the prior of the nonparametric component is a dense matrix with positive larger values near the diagonal, then decays gradually to 0 and negative when moving further. The reason why it has negative off-diagonal values is because the difference sequence {di}1≤i≤m+1 is standardized to be centered at 0. Theoretically, we do require the difference order m, which is also the bandwidth, goes to infinity as n → ∞ at some slow rate. In this way, the effect of the nonparametric component can be removed without over-smoothing the nonparametric function f(·), so the selection consistency for the linear component holds.
Fig. 1.

Visualization to display the magnitude of values in the difference matrix D, the covariance matrix Σαn used for the update of α and the covariance matrix for prior Σ0n. All plots are taking constant C = 0.6, sample size n = 200 and difference order m = 20. On the graph, red color indicates positive values on the the corresponding locations of the matrix, and purple color represents negative values.
2.3. Selection procedure
In the typical Bayesian variable selection approach, the model with the highest posterior is selected as the final model, referred to as maximum a posterior model (MAP): . With the spike and slab prior, the posterior of the model space is usually reflected by the posterior probability of the latent variable γ. Alternatively, another way is to consider the marginal probability of Pr(γj = 1|Y). Specifically, one will select the jth covariate if Pr(γj = 1|Y) is equal to or greater than a certain threshold. A threshold of 0.5 is a natural choice. This is known as the median probability model (MPM). It has been shown that MPM has good predictive power [1]. Although it is likely that these two approaches may produce different results in practice, it can be shown those two selection methods are asymptotically the same under strong selection consistency, which will be shown in Section 2.4. Moreover, some other data-driven criteria could also be used in determining the threshold, e.g. AIC, BIC and EBIC [5].
2.4. Theoretical results
Variable selection procedures typically aim to achieve selection consistency, and under Bayesian framework, it means conditional on observed data, the probability of the true model being selected goes to 1 in probability.
That is, the true model is selected consistently. Note that the posterior of model space is fully specified by γ. If the model is selected via MAP: , then the selection consistency only requires that the posterior probability of the true model, i.e. is no less than that of any other models, i.e. . But the difference in their posterior probabilities could still shrink to 0. In this paper, we will consider the following strong selection consistency
It indicates the difference for the posterior probabilities of the true model and any other model is 1. This nonzero difference indicates a stronger conclusion than selection consistency. We first present the following regularity conditions for the selection consistency of the linear component in the PLM, and we then start with the case when σ2 is known as it provides intuitive interpretation for the proposed method.
Condition A (On the dimension and priors). The dimension pn satisfies that ln(pn) = o(n). The prior probability that a coefficient is nonzero qn satisfies that qn ~ 1/pn. The variance for slab prior as n → ∞, and for some δ > 0, where λ1 is defined in Conditions C. The covariance for the prior of nonparametric component Σ0n = {(In – CD⊤D)−1 − In}+, where C is a positive constant, with values no greater than min {1, 1/λmax(D⊤D)}, and D is the difference matrix defined in (2).
Condition B (Identifiability). There exists K > 1 + 4/δ such that
where X* = C1/2DX and the projection matrix .
Condition C (Regularity of the design). Define
then , for some 0 < κ < δ, where mn is defined in Condition D.
Condition D (On the true model). Let consist of all nonzero elements of β. That is, consists of all active predictors. The size of satisfies that , mn = cn/ ln pn, where c < δ/ {(4 + δ)(2 + δ)}. Further assume that for some 0 < c1 ≤ 1.
Condition E (On the difference matrix). Let D be the difference matrix, as defined in (2). Denote , then the mth difference sequence d1, …, dm+1 satisfies,
Furthermore m → ∞, for some 0 < c2 < c1 ≤ 1, where c1 is defined in Condition D.
Condition F (On the nonparametric component f(·)). Suppose f(·) ∈ Λk(M) for some , where c1 and c2 is defined in Conditions D, E. The Lipschitz ball Λk(M) is defined as
where ⌊k⌋ is the largest integer less than k and k′ = k − ⌊k⌋.
The convergent and divergent rate of the parameters in the priors and the dimension of the variables are stated in Condition A. Identifiability Condition B is needed to distinguish active covariates out of spurious ones. C gives the regularity condition of the design matrix. Instead of requiring bounded eigenvalues, we will need the minimal eigenvalues to decay slower than some rate and the maximal eigenvalues to diverge slower than some rate. We would like to point that, if the size of the true model is not too large, the condition holds even with the extreme case when X is sampled from normal distribution with compound symmetric covariance matrix when correlation among predictors ρ → 1. Theoretically the model still works even under nearly perfectly correlated covariates. Condition D states the normality assumption for the error and we do allow infinitely many active variables.
Conditions E-F control the error in estimating the nonparametric component. Condition E is about the difference matrix. A sequence satisfying the above conditions, for example, is , . As argued in [23] about partial smoothing spline method for PLM, higher ordered difference operation gives lower approximation error. We do assume m → ∞, so the approximation error becomes ignorable. In the estimation of the partial residuals, it requires that the nonparametric estimators of E(Y|U) and E(X|U) to converge sufficiently fast so that their substitutions in the OLS estimator do not affect its asymptotic distribution. Similarly, our upper bound for the growth rate of difference order m reflects this. Finally, the commonly used smoothness assumption for nonparametric nuisance function is stated in Condition F.
The first step is to derive posterior probability of any model .
Lemma 1. Under fixed σ2, for any model , the posterior probability has the following explicit form:
where
and , Σ0n is the covariance matrix for the prior of f(U).
Furthermore, define the likelihood ratio between model and the true model as . If Conditions A, C hold, is bounded by
| (4) |
Remark 2. Lemma 1 gives the explicit form of the posterior probability for any given model and puts an upper bound on the likelihood ratio between model and the true model . Intuitively from (4), when is sufficiently large, is close to
So if Σ1n is taken to be CD⊤D, which means Σ0n = {(In – CD⊤D)−1 − In}+, then is proportional to the sum of squared residuals under model , after applying the difference-based method. It could be interpreted as the goodness of fit. Additionally, the first term in (4) could be regarded as the penalty on the model size. So it is mostly analogous to a L0 penalized method. As σ1n diverges fast, we can directly work with and instead of . The following two lemmas present some properties of and .
Lemma 2. For any model containing the true model, i.e , if conditions A, E, F hold, then
Lemma 3. Suppose that Conditions A, C, D are satisfied, then for any gn → ∞ and ϵ > 0,
, for some c′ > 0, where λ1 is defined in Condition C.
, for some c > 0.
Remark 3. Lemma 2 shows for over-fitted models, after applying the difference operation, the sum of squared residuals has an asymptotic χ2 distribution with the degrees of freedom as . It also gives the asymptotic distribution of . The difference of and is further bounded in Lemma 3. Lemma 3 (ii) is straightforward by using the tail bound with χ2 distributions.
Theorem 4. (Strong Selection Consistency under fixed σ2). Suppose that Conditions A, B, C, D, E, F hold for the partially linear model in (1) with ln pn = o(n), and are valid, we have
where mn = cn/ ln pn, as defined in Condition D.
Remark 4. It suffices to show
Recall that, by (4) in Lemma 1, we have . Inspired by [21], we first divide the model space into 3 disjoint parts, defined as below. In each group, we prove the sum of likelihood ratio converges to 0 in probability.
Consider the set of overfitted models . Model in this group contains all active variables, so might be large. However since , the number of extra spurious variables is penalized at the rate of .
For large and including some inactive variables models , let be the union of model and , then any model in this group has . Although size of may exceed , but since mn dominates so it is reasonable to assume the difference is negligible. Model will also have a better fit than . Thus we can control by bounding the value of . Since the size of models in is large, the growth of is under control.
Now consider the set of underfitted models missing some active variables, which is formulated as , where K is the constant defined in Condition B. For any model in this group, at least one active variable is missed. By Condition B on the identifiability of the active variables, will be large since model does not have a good fit, thus the value of is controlled.
Remark 5. In this paper, we only consider models that are not unreasonably large, that is, , where mn is at the order of n/ ln pn. There is a reason behind the choice of mn. It can be shown that the marginal probability for the inclusion of a single variable i.e. q = Pr(γi = 1) somehow controls the size of the selected model. Let . For any fixed choice of q, we can derive the upper bound on the selected model size
Since we require and in Conditions A, D, it reduces to
Theorem 5. (Strong Selection Consistency under unknown σ2). Suppose that Condition A, B, C, D, E, F hold for the partially linear model in (1) with ln pn = o(n), and are valid, we have
where mn = cn/ ln pn, as defined in Condition D.
By further integrating out σ2 and applying some inequalities, the problem reduces to intermediate steps in Theorem 4. Please refer to Section 4 for the proof.
Remark 6. As linear models are special cases of partially linear models, the proposed BSM-DP variable selection method may be directly applicable to linear models. We have also studied theoretical properties, finite sample performance and computational time of the BSM-DP under the setting of linear models and compared with BASAD [21] and SG [22]. To save space, all material related to BSM-DP for linear models are put in the supplementary material.
3. Numerical study
3.1. Simulation study
3.1.1. Simulation settings and the choice of hyperparameters
In this section, we compare the performance of the proposed method with several other existing methods including penalized methods on partial residuals and methods based on partial correlation of partial residuals. The penalized methods include the famous LASSO [25] and SCAD [8] tuned by BIC. The R packages msgps, ncvreg are used for LASSO and SCAD. Methods based on partial correlation include PC-simple algorithm on the partial residuals (PC-PR) [3] and threshold partial correlation on partial residuals (TPC-PR) [20]. Both PC-PR and TPC-PR select variables based on the magnitude of the partial correlation between the partial residuals of the response and the corresponding predictors, while the difference is on the threshold used for partial correlations. TPC-PR uses a threshold depending on the kurtosis of X, so the normality assumption for X is not necessary. For TPC-PR, we also consider a fine tuning on the critical value cT(α, n, , m), where c is the tuning parameter chosen by EBIC [5]. The method is denoted as TPC-PR.EBIC.
First we need to specify the hyper-parameters σ0, σ1n, qn, m, α0, β0. Partially refer to the choice in [21], we use α0 = 2, β0 = 5, and σ0 = 0.1 for our proposed method (BSM-DP). The order of the difference operator is set to be m = ⌊5n1/5⌋. Additionally, the variance for the diffusing prior is set as , and we choose qn = Pr(γi = 1) such that , for a prespecified value of K. The value of K can be our preliminary guess for the size of active set, and for example, we can use the size of active set selected by LASSO. In this paper we simply set K = 10. In each of the following case, we allow 6000 iterations, and treat the first 3000 as burn-in samples. We report simulation results based on both MAP and MPM for our proposed method.
We fix n = 200, p = 1000 and the true active set with coefficients of . The error ϵi is drawn from standard normal distribution . Fixed design of Ui = i/n, i ∈ {1, …, n} is used with three different types of X:
Case 1. Type I normal distribution with autoregressive covariance matrix: , where Σij = ρ|i−j|.
Case 2. Type II normal distribution with compound symmetric covariance matrix: , where Σij = 1 for i = j and Σij = ρ for i ≠ j.
Case 3. Type III mixture of normals: X is sampled from with probability 0.9 and from with probability 0.1, where Σ is the compound symmetric correlation matrix with correlation ρ.
In each case, we also consider low and high correlations at ρ = 0.2 and ρ = 0.8 separately, with two choices of the nonparametric component f(U) = U2 and f(U) = sin(2πU). The following evaluation criteria are used for comparing methods based on 500 replications:
and : the average for the maximal of marginal posterior probabilities on true inactive covariates, and the average for the minimal of marginal posterior probabilities on true active covariates.
: the proportion of replications with the exact model being selected.
: the proportion of replications with all true active variables being selected.
pi: the proportion of replications that the ith true active variable is selected successfully, i ∈ {1, 4}.
TP (true positive): the average number of true active variables selected.
FP (false positive): the average number of selected variables that are actually inactive.
ME (model error): .
Note that for those existing methods, partial residuals are firstly obtained. Getting the partial residuals involves the value of E(Y|U), which is estimated by the local linear regression, followed by [20]. The bandwidth is chosen via plug-in method using R package KernelSmooth. Afterwards, with LASSO and SCAD, we are able to estimate β and achieve variable selection simultaneously in the second step. While for the partial correlation methods including PC and TPC, an estimation of the active set needs to be obtained first, then is estimated by regressing the partial residuals on through the least squares method. Details can be found in [20]. In this simulation, we use the posterior mean of β as for the proposed method.
3.1.2. Simulation results
Table 1 – Table 3 record the mean results from those 500 replications. Case 1 is with the decayed autoregressive covariance matrix. Under the case with low correlation ρ = 0.2, all methods perform well regardless of the type of the nonparametric function. For the situation with high correlation of ρ = 0.8, LASSO is prone to overfit the model, with the exact model being selected only around 20% of the time.
Table 1.
Summarized simulation results for Case 1: p = 1000, n = 200 and X is sampled from normal distribution with autoregressive correlation matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1.
| Method | p1 | p4 | TP | FP | ME | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| LASSO.BIC | 0.760 | 1.000 | 1.000 | 1.000 | 4.000 | 0.256 | 0.313 | |||
| SCAD.BIC | 0.932 | 1.000 | 1.000 | 1.000 | 4.000 | 0.082 | 0.026 | |||
| f(u) = u2 | PC-PR | 0.966 | 0.994 | 0.994 | 1.000 | 3.994 | 0.030 | 0.040 | ||
| ρ = 0.2 | TPC-PR | 0.964 | 0.994 | 0.994 | 1.000 | 3.994 | 0.032 | 0.041 | ||
| TPC-PR.EBIC | 0.994 | 1.000 | 1.000 | 1.000 | 4.000 | 0.010 | 0.027 | |||
| BSM-DP.MAP (new) | 0.073 | 1.000 | 0.974 | 1.000 | 1.000 | 1.000 | 4.000 | 0.032 | 0.026 | |
| BSM-DP.MPM (new) | 0.974 | 1.000 | 1.000 | 1.000 | 4.000 | 0.030 | 0.026 | |||
| LASSO.BIC | 0.738 | 1.000 | 1.000 | 1.000 | 4.000 | 0.288 | 0.334 | |||
| SCAD.BIC | 0.924 | 1.000 | 1.000 | 1.000 | 4.000 | 0.120 | 0.029 | |||
| f(u) = sin(2πu) | PC-PR | 0.996 | 0.994 | 0.994 | 1.000 | 3.994 | 0.032 | 0.045 | ||
| ρ = 0.2 | TPC-PR | 0.958 | 0.994 | 0.994 | 1.000 | 3.994 | 0.040 | 0.045 | ||
| TPC-PR.EBIC | 0.982 | 1.000 | 1.000 | 1.000 | 4.000 | 0.018 | 0.029 | |||
| BSM-DP.MAP (new) | 0.074 | 1.000 | 0.974 | 1.000 | 1.000 | 1.000 | 4.000 | 0.028 | 0.026 | |
| BSM-DP.MPM (new) | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.032 | 0.026 | |||
| LASSO.BIC | 0.222 | 1.000 | 1.000 | 1.000 | 4.000 | 1.266 | 0.309 | |||
| SCAD.BIC | 0.984 | 0.998 | 0.998 | 1.000 | 3.998 | 0.020 | 0.032 | |||
| f(u) = u2 | PC-PR | 0.644 | 0.652 | 0.768 | 1.000 | 3.652 | 0.094 | 0.344 | ||
| ρ = 0.8 | TPC-PR | 0.652 | 0.660 | 0.774 | 1.000 | 3.660 | 0.092 | 0.338 | ||
| TPC-PR.EBIC | 0.846 | 0.858 | 0.886 | 1.000 | 3.858 | 0.038 | 0.149 | |||
| BSM-DP.MAP (new) | 0.053 | 1.000 | 0.990 | 1.000 | 1.000 | 1.000 | 4.000 | 0.010 | 0.023 | |
| BSM-DP.MPM (new) | 0.994 | 1.000 | 1.000 | 1.000 | 4.000 | 0.006 | 0.023 | |||
| LASSO.BIC | 0.172 | 1.000 | 1.000 | 1.000 | 4.000 | 1.362 | 0.321 | |||
| SCAD.BIC | 0.986 | 0.998 | 0.998 | 1.000 | 3.998 | 0.014 | 0.040 | |||
| f(u) = sin(2πu) | PC-PR | 0.650 | 0.656 | 0.796 | 1.000 | 3.998 | 0.014 | 0.348 | ||
| ρ = 0.8 | TPC-PR | 0.662 | 0.668 | 0.804 | 1.000 | 3.666 | 0.132 | 0.337 | ||
| TPC-PR.EBIC | 0.840 | 0.854 | 0.906 | 1.000 | 3.854 | 0.064 | 0.159 | |||
| BSM-DP.MAP (new) | 0.055 | 1.000 | 0.982 | 1.000 | 1.000 | 1.000 | 4.000 | 0.018 | 0.023 | |
| BSM-DP.MPM (new) | 0.984 | 1.000 | 1.000 | 1.000 | 4.000 | 0.016 | 0.023 |
Table 3.
Summarized simulation results for Case 3: p = 1000, n = 200 and X is sampled from mixture of normals with compound symmetric matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1. Results under high correlation ρ = 0.8 are highlighted.
| Method | p1 | p4 | TP | FP | ME | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| LASSO.BIC | 0.156 | 1.000 | 1.000 | 1.000 | 4.000 | 2.434 | 0.475 | |||
| SCAD.BIC | 0.994 | 1.000 | 1.000 | 1.000 | 4.000 | 0.008 | 0.022 | |||
| f(u) = u2 | PC-PR | 0.476 | 0.690 | 0.724 | 1.000 | 3.676 | 1.304 | 0.927 | ||
| ρ = 0.2 | TPC-PR | 0.574 | 0.574 | 0.636 | 1.000 | 3.552 | 0.622 | 1.252 | ||
| TPC-PR.EBIC | 0.680 | 0.694 | 0.726 | 1.000 | 3.680 | 0.612 | 0.891 | |||
| BSM-DP.MAP (new) | 0.046 | 1.000 | 0.984 | 1.000 | 1.000 | 1.000 | 4.000 | 0.016 | 0.015 | |
| BSM-DP.MPM (new) | 0.986 | 1.000 | 1.000 | 1.000 | 4.000 | 0.014 | 0.015 | |||
| LASSO.BIC | 0.228 | 1.000 | 1.000 | 1.000 | 4.000 | 2.194 | 0.449 | |||
| SCAD.BIC | 0.990 | 1.000 | 1.000 | 1.000 | 4.000 | 0.012 | 0.018 | |||
| f(u) = sin(2πu) | PC-PR | 0.572 | 0.768 | 0.802 | 1.000 | 3.776 | 0.982 | 0.670 | ||
| ρ = 0.2 | TPC-PR | 0.618 | 0.622 | 0.682 | 1.000 | 3.602 | 0.544 | 1.125 | ||
| TPC-PR.EBIC | 0.748 | 0.768 | 0.804 | 1.000 | 3.764 | 0.438 | 0.667 | |||
| BSM-DP.MAP (new) | 0.050 | 1.000 | 0.986 | 1.000 | 1.000 | 1.000 | 4.000 | 0.016 | 0.014 | |
| BSM-DP.MPM (new) | 0.988 | 1.000 | 1.000 | 1.000 | 4.000 | 0.012 | 0.014 | |||
| LASSO.BIC | 0.000 | 0.998 | 0.998 | 1.000 | 3.998 | 14.060 | 0.395 | |||
| SCAD.BIC | 0.380 | 0.382 | 0.432 | 1.000 | 3.374 | 0.120 | 0.652 | |||
| f(u) = u2 | PC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| ρ = 0.8 | TPC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| TPC-PR.EBIC | *** | *** | *** | *** | *** | *** | *** | |||
| BSM-DP.MAP (new) | 0.092 | 1.000 | 0.962 | 1.000 | 1.000 | 1.000 | 4.000 | 0.042 | 0.025 | |
| BSM-DP.MPM (new) | 0.964 | 1.000 | 1.000 | 1.000 | 4.000 | 0.040 | 0.025 | |||
| LASSO.BIC | 0.000 | 0.996 | 0.996 | 1.000 | 3.996 | 14.002 | 0.418 | |||
| SCAD.BIC | 0.422 | 0.426 | 0.492 | 1.000 | 3.416 | 0.022 | 0.672 | |||
| f(u) = sin(2πu) | PC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| ρ = 0.8 | TPC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| TPC-PR.EBIC | *** | *** | *** | *** | *** | *** | *** | |||
| BSM-DP.MAP (new) | 0.082 | 1.000 | 0.960 | 1.000 | 1.000 | 1.000 | 4.000 | 0.042 | 0.020 | |
| BSM-DP.MPM (new) | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.032 | 0.020 |
stands for the cases when a single replication takes more than 48 hours. So 500 replications can not be done in timely manner.
It gets more challenging to identify true covariates under dense correlation, which is Case 2 with the compound symmetric covariance matrix. The exact-fit rates are much lower for most of the methods as compared to that in Case 1. It is noteworthy that under high correlation ρ = 0.8, except for our proposed BSM-DP, most other methods do not perform well. LASSO consistently selects a larger model with about 16 spurious covariates on average, while SCAD tuned by BIC is prone to select a smaller model. PC and TPC work evidently slow under dense correlation, when ρ = 0.2, it takes around 12 hours for PC, TPC and TPC-EBIC to finish one replication. More than 48 hours are needed when ρ = 0.8, thus we mark it as stars (***) since 500 replications can not be done in timely manner. Our proposed method (BSM-DP) gives the best results, with high exact-fit rates (above 95%) even under the high dense correlation situation.
In Case 3, X is generated from a mixture normal distribution, with a heavier tail than the normal distribution. Since PC-PR relies heavily on the normality of the covariates, it gives poor results. The updated version of TPC without assuming normality shows improvement. Our proposed method (BSM-DP) still stands out in the comparison with around 95% perfect exact-fit rates.
Overall, when correlation increases, LASSO tuned by BIC tends to overfit the model while SCAD tuned by BIC is more likely to select a smaller model. When X is normally distributed, PC and TPC are similar. But when the normality assumption is violated, TPC performs better than PC. The newly proposed method BSM-DP performs consistently the best, regardless of the correlation strength and distribution of X. The exact-fit rates for all cases are all above 90%.
Remark 7. (On model selection procedure) In our simulation study, models selected by MAP and MPM are very similar to each other. We do not need to select the threshold with MAP. With the MPM, the jth variable is selected if its posterior probability Pr(γj = 1|Y) ≥ 0.5. In order to investigate the impact of different thresholds other than 0.5, we further explore one simulation setting Case 2 with ρ = 0.8, and consider various threshold values from 0 to 1. The results are presented in Fig. 2.. With a smaller threshold, more spurious variables are likely to enter the model, so the false discovery rate (dotted red line) is higher. While with a larger threshold which associates with a more stringent selection criterion, we have a higher chance to miss active variables. It is worth noting that although for our simulation setting with shown in Fig. 2.(a), the true positive rate (dashed green line) is consistently high as all active variables have marginal inclusion probabilities as 1, but generally we may expect a drop when threshold approaching 1 for most cases as in Fig. 2.(b) with lower signal . The exact-fit rate (solid blue line), which gives the proportion of replications with exact model being selected, is reasonably good, as long as the threshold is neither too large nor too small. Overall, 0.5 looks like a good choice.
Fig. 2.

Impact of model selection performance using different threshold values for the posterior probability. The criteria displayed to measure performance include exact-fit rate (solid blue line), false discovery rate (dotted red line) and true positive rate (dashed green line). A range of threshold values from 0 and 1 are used to plot the curve for each criteria. Two signal values with different strengths are considered: and .
We also compare the computation time required by different methods. Among the five methods mentioned above, LASSO and SCAD are the fastest which take about 2 minutes in R to finish estimation for one replication. PC and TPC are fast as well when covariate X has decayed covariance matrix. They become dramatically slow with high and dense correlation. More than 48 hours is needed for PC using R to finish one replication when the covariance of X is compound symmetric with ρ = 0.8. It may get worse with higher dimensional covariates and larger active groups, since the computational time for both PC and TPC grows polynomially with them. Time is recorded based on Macbook Pro early 2015 with 2.7 GHZ, Intel i5 and 8 GB.
On the contrary, the computation burden for the newly proposed Bayesian method (BSM-DP) is moderate. Among all simulation settings, the slowest one takes about 12 minutes to finish 6000 iterations for one replication using Julia 0.6. As discussed in Remark 1 Section 2.1, the Bayesian subset modeling is scalable, and the computational time only grows approximately linearly with the dimension of covariates. Based on the estimation procedure, the computational complexity for each iteration is n(p ∨ |γ|2 ∨ n2), where |γ| is the current active model size. To explore the change in the computation time for different (and especially higher) dimensions of covariates, we record the CPU time to finish 6000 iterations with various p from 100 to 4000, in the simulation setting Case 1 ρ = 0.8. The result is presented in Fig. 3.. The computation time increases nearly linearly with the dimension p. It is not perfectly linear since the number of iterations until convergence seems to grow with p. We notice that for small p (e.g. p < 500), it usually only takes a few iterations to converge and end up with a small |γ|, but it often requires more iterations when p gets larger. There is one place in the above plot which shows a large jump. This might relate to the caching limit on the hardware, especially when p is large.
Fig. 3.

Change in computational time (in minutes) when the dimension of covariates increases from 100 to 4000. The CPU time is estimated by the median computation time consumed among 10 replications for each dimension setting.
3.2. A real data example — supermarket data analysis
In this section, the proposed method is applied to analyze a supermarket data set mentioned in [6, 20, 26]. The data set contains n = 464 daily records of the number of customers, which is the response variable and the sales of p = 6398 products which are predictors. Both the response variable and the predictors are standardized to have zero mean and unit variance. The plot on the left of Fig. 4 shows the relationship between the number of customers and the days. The periodicity of Y is obvious, thus it is reasonable to model Y with a PLM, which takes time variation into account. A covariate Ui = i/n is introduced to represent time. To check the correlation among predictors, we plot the histogram of the sampled correlation in Fig. 4., which shows some moderate correlation. We randomly select 75% of the observations (Xi, Yi, Ui), i ∈ {1,…, 464) as the training set and keep the remaining 25% as the testing set. The PLM is fitted and variables are selected using the training dataset, then the mean-squared errors of the selected model on the testing dataset are calculated to evaluate the model fit. This procedure is repeated 100 times, and Table 4 summarizes the average size of the selected models and the mean squared errors.
Fig. 4.

Visualizations to display features of the (standardized) supermarket data set. The plot on the left gives the trend of daily number of customers entering a supermarket for 464 days. The histogram on the right describes the distribution of the correlation among sampled predictors.
Table 4.
Comparions of the resulting model size and the mean squared errors by different methods. The values in the table are the means and the corresponding standard errors (in the parenthesis) over the 100 replications. The methods compared are LASSO and SCAD on partial residuals tuned by BIC (SIS-LASSO.BIC, SIS-SCAD.BIC), PC-simple algorithm on the partial residuals (SIS-PC-PR), threshold partial correlation on partial residuals (SIS-TPC-PR, SIS-TPC-PR.EBIC), proposed method with model selected by MPM tuned by EBIC (SIS-BSM-DP.EBIC).
| Method | Model Size (s.e.) | MSE on training set (s.e.) | MSE on testing set (s.e.) |
|---|---|---|---|
| SIS-LASSO.BIC | 28.80 (3.32) | 0.0575 (0.0030) | 0.0836 (0.0120) |
| SIS-SCAD.BIC | 15.75 (5.44) | 0.0647 (0.0062) | 0.0907 (0.0143) |
| SIS-PC-PR | 12.94(1.03) | 0.0497 (0.0031) | 0.0847 (0.0112) |
| SIS-TPC-PR | 9.81 (0.92) | 0.0540 (0.0034) | 0.0860 (0.0109) |
| SIS-TPC-PR.EBIC | 8.50 (0.98) | 0.0559 (0.0038) | 0.0867 (0.0117) |
| SIS-BSM-DP.EBIC (new) | 7.95(1.91) | 0.0610 (0.0055) | 0.0864 (0.0122) |
To take the dimension down to a moderate scale, we first apply the SIS [9] on the partial residual to only keep the top 2000 predictors as the set subjected to variable selection. We also implement the LASSO.BIC, SCAD.BIC, PC-PR, TPC-PR and TPC-PR-EBIC on the same data set as comparisons. The choice of the hyperparameters for BSM-DP is the same as the set up in the simulation. we complete 10000 iterations, with the first 6000 as burn-in samples, and rank covariates by their marginal posterior probabilities Pr(γj = 1|Y). The candidate set is further selected by EBIC.
After obtaining partial residuals, with LASSO and SCAD, we are able to obtain and select variables simultaneously. But for PC, TPC, and BSM-DP, an estimation of the active set is first obtained, then is estimated by regressing the partial residuals on through OLS. Since we derive the theoretical property of γ, we only use BSM-DP to select the set of active covariates, is also obtained by regressing the partial residuals on the selected covariate set.
As shown in Table 4, LASSO gives the most conservative result with an average size of selected models as 28.80. On the other hand, the MSE is much smaller on the training than that on the testing. This suggests that the LASSO may be overfitting. SCAD selects smaller models with size 15.75 on average, but having large error on testing set. PC, TPC, TPC.EBIC and the newly proposed BSM-DP all select even smaller models with the average size less than 10. Among all, our proposed BSM-DP selects the smallest number of covariates with very similar value of the mean squared error on the testing data set. Fig. 5. illustrates a comparison on the estimated nonparametric function obtained by different methods.
Fig. 5.

The estimates of the nonparametric function for the supermarket data set by different methods including: LASSO and SCAD on partial residuals tuned by BIC (SIS-LASSO.BIC, SIS-SCAD.BIC), PC-simple algorithm on the partial residuals (SIS-PC-PR), threshold partial correlation on partial residuals (SIS-TPC-PR, SIS-TPC-PR.EBIC), and proposed method with model selected by MPM tuned by EBIC (SIS-BSM-DP.EBIC).
4. Technical proofs
This section includes technical proofs for Lemma 1 – Lemma 3 and Theorem 4 – Theorem 5.
Proof of Lemma 1. Note that, as qn → 0, which is stated in Condition A, we first write out the posterior distribution for parameters as
We first integrate out α, and it follows that
Denote . It follows by integrating out β that
| (5) |
where
Let X* = C1/2DX, , . As , for any model , is asymptotically equal to , where is the subset of by taking last n − m elements. By Conditions C, E, it can be shown that and . Now we put bound on . For any model :
Thus, , so
□
Proof of Lemma 2. By Condition A, if Σ0n is taken to be Σ0n = {(In − CD⊤D)− 1 – In}+, since Σ1n is defined as , so we have
where is the difference matrix defined in (2) and 0 < C ≤ min{1, 1/λmax(D⊤D)} is a constant, thus Σαn = In − Σ1n and are semi-positive definite. By taking difference operation on each side, we have
where , , , . The projection matrix is defined as and furthermore we denote . Under the true model, we have , then for any model containing the true model ,
It suffices to show a.s., a.s. and a.s..
Step 1: To show that a.s.: note that
where . Thus by , , we have
By Cauchy-Schwartz inequality and Condition F, for any f (·) ∈ Λk(M),
By Conditions D, E, F, 1 + (c2 − c1)2(k ∧ 1) < 0, it can be shown that . Then a.s..
Step 2: To show that a.s.: under fixed design, we have
Since DD⊤ → In−m and a.s. from Step 1, so a.s.
Step 3: To show that a.s.: by Condition E, . Let , then ω = C1/2D∈ = C1/2J∈ + C1/2(D − J)∈ = ω1 + ω2. Since D → J, so ω2 is negligible as compared to ω1 as n goes to infinity. Furthermore , therefore a.s..
Overall we have a.s.. Similarly, write . It can be proven that the second and third terms are almost surely 0, so . □
Proof of Lemma 3. We first prove part (i). Note that , , thus
where the first equality is due to the Woodbury matrix identity A−1 − (A + UCV)−1 = A−1U(C−1 + VA−1U)−1 VA−1. Denote , which has rank and . By [13], we can derive the tail bound for the quadratic term:
We next prove part (ii). By Lemma 2, , by the tail bound for χ2 distribution in [17], for any positive x, we have
Furthermore, since m = o(n), thus for any fixed ∈ > 0, there exists a constant c > 0, such that
□
Proof of Theorem 4. We will use strategy related that of Theorem 4.1 in [21] to establish Theorem 4.
For overfitted models , we first put bound on . By Lemma 2, for , we have . For any x > 0, , there exists some constant c > 0, such that
Define events , .
Then for any fixed s > 0, there exists some c, c′ > 0, such that
where the first inequality is due to the fact that . By Lemma 3 and Condition A, it follows that
Then,
So consider the high probability event , we have
As 0 < C < 1, set 0 < s < δ/2, then there exists c > 0 such that
Consider large and missing some active variables models . Define events
Similar to the proof for , there exists c′ > 0, such that
Take s = δ/4, we can find such w′ > 0 as long as . That is, K > 1 + 4/δ, which is stated in Condition B. It follows that
Then consider the high probability event ,
For any model belonging to the group of underfitted models , it follows that
By Condition B,
And on the other hand, . Among them, . By the similar trick in Lemma 2, it can be shown a.s..
For any 0 < w < 1,
The last step follows by the bound for tail with quadratic form. For any 0 < w′ < 1,
where the first inequality is due to the fact that , and the last inequality follows by the exponential tails of . The proof is similar to Lemma 2 (1).
Let c = 2w, it follows that
where the second inequality holds because , and the last step follows by Condition B. Then consider the high probability event ,
By Condition B, , so, we can find 0 < c < 1, w′ > 0, such that . Thus,
□
Proof of Theorem 5. Similar to Theorem 4.2 in [21], we start from (5) and integrate out σ2, thus the posterior probability under unknown σ2 is
Remark 8. Inside the integration, is the dominant term as the sum of squared residuals has an order of Op(n). The theorem applies to a wider family of prior as long as π(σ2) is Op(1) and the support is not too strange. This includes some commonly used priors like improper non-informative prior π(σ2) ∝ σ−2 and the class of folded-noncentral-t prior with fixed hyper-parameters .
By Lemma 1, we have
Define , then , since
First consider overfitted models . Define , tn = − ln {1 − 2(1 + 4s)xn}. Then by Condition D, . Since , then for any s < δ/16, we have
Similar to the proof in Theorem 4 (1), consider the high probability event , where (1 +4s)(1 − ∈*) > (1 + 2s), then
The problem reduces to the same problem in Theorem 4 (1). And for , we can use the same trick, thus we have
For underfitted models , similar to the proof in Theorem 4 (3), consider the high probability event .
If Δn(K) = o(n), by limn→∞(1 + 1/n)n = e,
It reduces to the same problem in Theorem 4 (3).
While if Δn(K) ⪰ n, it follows that
which converges even faster to 0 as n → ∞. □
5. Discussion
Inspired by the difference-based method, we have proposed a new Bayesian approach to select variables in the linear component of the partially linear model. We modify the Bayesian shrinking and diffusing priors (BASAD) [21], and propose the new Bayesian subset modeling with diffusing prior (BSM-DP). The idea is extended from linear models to the partially linear model with the help of the difference-based method. Model selection consistency is proved under the setting with ultra-high dimensional covariates. Compared to BASAD, BSM-DP performances better in identifying the low signal covariates and in shorter computation time, as shown in the supplementary material. Results in the simulation studies show that our method has higher tolerance on the correlation among predictors and requires mild conditions on covariates, compared with other existing methods for variable selection on PLM. The proposed model is less likely to overfit the model, which is also illustrated by the real data example about supermarket. However, like all other Bayesian methods, it has some price to pay. We do need specific assumptions on the error distribution. The computation is relative intense as compared to frequentist penalized methods. Finally, similar to frequentist methods, although we showed the required rates for the hyperparameters of the priors, the practical choices of them in finite sample applications still need fine tuning.
Supplementary Material
Table 2.
Summarized simulation results for Case 2: p = 1000, n = 200 and X is sampled from normal distribution with compound symmetric correlation matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1. Results under high correlation ρ = 0.8 are highlighted.
| Method | p1 | p4 | TP | FP | ME | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| LASSO.BIC | 0.220 | 1.000 | 1.000 | 1.000 | 4.000 | 1.652 | 0.509 | |||
| SCAD.BIC | 0.938 | 1.000 | 1.000 | 1.000 | 4.000 | 0.070 | 0.028 | |||
| f(u) = u2 | PC-PR | 0.420 | 0.998 | 0.998 | 1.000 | 3.998 | 0.766 | 0.063 | ||
| ρ = 0.2 | TPC-PR | 0.406 | 0.998 | 0.998 | 1.000 | 3.998 | 0.782 | 0.063 | ||
| TPC-PR.EBIC | 0.862 | 0.998 | 0.998 | 1.000 | 3.998 | 0.140 | 0.040 | |||
| BSM-DP.MAP (new) | 0.077 | 1.000 | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.030 | 0.024 | |
| BSM-DP.MPM (new) | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.024 | 0.024 | |||
| LASSO.BIC | 0.202 | 1.000 | 1.000 | 1.000 | 4.000 | 1.746 | 0.515 | |||
| SCAD.BIC | 0.950 | 1.000 | 1.000 | 1.000 | 4.000 | 0.054 | 0.027 | |||
| f(u) = sin(2πu) | PC-PR | 0.368 | 0.984 | 0.984 | 1.000 | 3.984 | 0.870 | 0.098 | ||
| ρ = 0.2 | TPC-PR | 0.358 | 0.986 | 0.986 | 1.000 | 3.986 | 0.884 | 0.095 | ||
| TPC-PR.EBIC | 0.826 | 0.990 | 0.990 | 1.000 | 3.990 | 0.208 | 0.060 | |||
| BSM-DP.MAP (new) | 0.087 | 1.000 | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.036 | 0.032 | |
| BSM-DP.MPM (new) | 0.976 | 1.000 | 1.000 | 1.000 | 4.000 | 0.026 | 0.033 | |||
| LASSO.BIC | 0.000 | 1.000 | 1.000 | 1.000 | 4.000 | 16.174 | 0.390 | |||
| SCAD.BIC | 0.386 | 0.386 | 0.412 | 1.000 | 3.386 | 0.000 | 0.585 | |||
| f(u) = u2 | PC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| ρ = 0.8 | TPC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| TPC-PR.EBIC | *** | *** | *** | *** | *** | *** | *** | |||
| BSM-DP.MAP (new) | 0.105 | 1.000 | 0.942 | 1.000 | 1.000 | 1.000 | 4.000 | 0.068 | 0.034 | |
| BSM-DP.MPM (new) | 0.956 | 1.000 | 1.000 | 1.000 | 4.000 | 0.048 | 0.037 | |||
| LASSO.BIC | 0.000 | 1.000 | 1.000 | 1.000 | 4.000 | 16.114 | 0.401 | |||
| SCAD.BIC | 0.426 | 0.426 | 0.448 | 1.000 | 3.426 | 0.000 | 0.574 | |||
| f(u) = sin(2πu) | PC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| ρ = 0.8 | TPC-PR | *** | *** | *** | *** | *** | *** | *** | ||
| TPC-PR.EBIC | *** | *** | *** | *** | *** | *** | *** | |||
| BSM-DP.MAP (new) | 0.093 | 1.000 | 0.970 | 1.000 | 1.000 | 1.000 | 4.000 | 0.036 | 0.032 | |
| BSM-DP.MPM (new) | 0.976 | 1.000 | 1.000 | 1.087 | 4.000 | 0.026 | 0.033 |
stands for the cases when a single replication takes more than 48 hours. So 500 replications can not be done in timely manner.
Acknowledgments
The authors are grateful to the Editor-in-Chief, an Associate Editor and the referees for comments and suggestions that led to significant improvements. This research was supported by NSF grants DMS 1820702, DMS 1953196, DMS 2015539 and NIH grants R01CA229542 and R01 ES019672. The content is solely the responsibility of the authors and does not necessarily represent the official views of NSF and NIH.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Barbieri MM, Berger JO, Optimal predictive model selection, Ann. Statist 32 (2004) 870–897. [Google Scholar]
- [2].Breiman L, Better subset regression using the nonnegative garrote, Technometrics 37 (1995) 373–384. [Google Scholar]
- [3].Bühlmann P, Kalisch M, Maathuis MH, Variable selection in high-dimensional linear models: partially faithful distributions and the pc-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]
- [4].Chen H, Shiau J-JH, A two-stage spline smoothing method for partially linear models, J. Statist. Plann. Inference 27 (1991) 187–201. [Google Scholar]
- [5].Chen J, Chen Z, Extended bic for small-n-large-p sparse glm, Statist. Sinica 22 (2012) 555–574. [Google Scholar]
- [6].Chen Z, Fan J, Li R, Error variance estimation in ultrahigh-dimensional additive models, J. Amer. Statist. Assoc 113 (2018) 315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Engle RF, Granger CWJ, Rice J, Weiss A, Semiparametric estimates of the relation between weather and electricity sales, J. Amer. Statist. Assoc 81 (1986) 310–320. [Google Scholar]
- [8].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]
- [9].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].George EI, Foster DP, Calibration and empirical bayes variable selection, Biometrika 87 (2000) 731–747. [Google Scholar]
- [11].George EI, McCulloch RE, Variable selection via gibbs sampling, J. Amer. Statist. Assoc 88 (1993) 881–889. [Google Scholar]
- [12].Heckman NE, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B Methodol 48 (1986) 244–248. [Google Scholar]
- [13].Hsu D, Kakade S, Zhang T, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab 17 (2012) 1–6. [Google Scholar]
- [14].Ishwaran H, Rao JS, Spike and slab variable selection: frequentist and bayesian strategies, Ann. Statist 33 (2005) 730–773. [Google Scholar]
- [15].Ishwaran H, Rao JS, Consistency of spike and slab regression, Statist. Probab. Lett 81 (2011) 1920–1928. [Google Scholar]
- [16].Johnson VE, Rossell D, Bayesian model selection in high-dimensional settings, J. Amer. Statist. Assoc 107 (2012) 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Laurent B, Massart P, Adaptive estimation of a quadratic functional by model selection, Ann. Statist 28 (2000) 1302–1338. [Google Scholar]
- [18].Liang F, Song Q, Yu K, Bayesian subset modeling for high-dimensional generalized linear models, J. Amer. Statist. Assoc 108 (2013) 589–606. [Google Scholar]
- [19].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Liu J, Lou L, Li R, Variable selection for partially linear models via partial correlation, J. Multivariate Anal 167 (2018) 418–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Narisetty NN, He X, Bayesian variable selection with shrinking and diffusing priors, Ann. Statist 42 (2014) 789–817. [Google Scholar]
- [22].Narisetty NN, Shen J, He X, Skinny gibbs: A consistent and scalable gibbs sampler for model selection, J. Amer. Statist. Assoc 114 (2019) 1205–1217. [Google Scholar]
- [23].Rice J, Convergence rates for partially splined models, Statist. Probab. Lett 4 (1986) 203–208. [Google Scholar]
- [24].Robinson PM, Root-n-consistent semiparametric regression, Econometrica 56 (1988) 931–954. [Google Scholar]
- [25].Tibshirani R, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol 58 (1996) 267–288. [Google Scholar]
- [26].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc 104 (2009) 1512–1524. [Google Scholar]
- [27].Wang L, Brown LD, Cai TT, A difference based approach to the semiparametric partial linear model, Electron. J. Stat 5 (2011) 619–641. [Google Scholar]
- [28].Xie H, Huang J, Scad-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]
- [29].Yatchew A, An elementary estimator of the partial linear model, Econ. Lett 57 (1997) 135–143. [Google Scholar]
- [30].Yuan M, Lin Y, Efficient empirical bayes variable selection and estimation in linear models, J. Amer. Statist. Assoc 100 (2005) 1215–1225. [Google Scholar]
- [31].Yuan M, Lin Y, On the non-negative garrotte estimator, J. R. Stat. Soc. Ser. B Stat. Methodol 69 (2007) 143–161. [Google Scholar]
- [32].Zhang C-H, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist 38 (2010) 894–942. [Google Scholar]
- [33].Zhu L, Li R, Cui H, Robust estimation for partially linear models with large-dimensional covariates, Science China Mathematics 56 (2013) 2069–2088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Zou H, Hastie T, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol 67 (2005) 301–320 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
