Variable selection for partially linear models via Bayesian subset modeling with diffusing prior

Jia Wang; Xizhen Cai; Runze Li

doi:10.1016/j.jmva.2021.104733

. Author manuscript; available in PMC: 2022 May 1.

Published in final edited form as: J Multivar Anal. 2021 Feb 13;183:104733. doi: 10.1016/j.jmva.2021.104733

Variable selection for partially linear models via Bayesian subset modeling with diffusing prior

Jia Wang ^a, Xizhen Cai ^b, Runze Li ^a,^*

PMCID: PMC8046162 NIHMSID: NIHMS1685646 PMID: 33867594

Abstract

Most existing methods of variable selection in partially linear models (PLM) with ultrahigh dimensional covariates are based on partial residuals, which involve a two-step estimation procedure. While the estimation error produced in the first step may have an impact on the second step, multicollinearity among predictors adds additional challenges in the model selection procedure. In this paper, we propose a new Bayesian variable selection approach for PLM. This new proposal addresses those two issues simultaneously as (1) it is a one-step method which selects variables in PLM, even when the dimension of covariates increases at an exponential rate with the sample size, and (2) the method retains model selection consistency, and outperforms existing ones in the setting of highly correlated predictors. Distinguished from existing ones, our proposed procedure employs the difference-based method to reduce the impact from the estimation of the nonparametric component, and incorporates Bayesian subset modeling with diffusing prior (BSM-DP) to shrink the corresponding estimator in the linear component. The estimation is implemented by Gibbs sampling, and we prove that the posterior probability of the true model being selected converges to one asymptotically. Simulation studies support the theory and the efficiency of our methods as compared to other existing ones, followed by an application in a study of supermarket data.

Keywords: Bayesian variable selection, Difference-based method, Selection consistency, Semiparametric modeling 2010 MSC: Primary 62G08, Secondary 62J05

1. Introduction

Semiparametric model attracts considerable attention in the literature since it retains the interpretability of the parametric models and keeps some flexibility of the nonparametric models. In this paper, we study a type of commonly used semiparametric model, the partially linear model (PLM). The PLM assumes that, the response Y depends both linearly on some covariates $X \in ℝ^{p}$ of interest, and nonparametrically on another univariate continuous covariate U defined on [0, 1]. Suppose that the observed data {(Y_i, X_i, U_i)}, i ∈ {1, …, n}, is a random sample from the following PLM

Y = f (U) + X^{⊤} β + ϵ, ϵ ~ N (0, σ^{2}) .

(1)

This PLM specifies a parsimonious linear function in the parametric part, while allowing a nonparametric component to be unconstrained and subject to empirical estimation. In this paper, a new one-step Bayesian approach is proposed to select variables for PLM with ultrahigh dimensional covariate X, that is ln p = o(n). Specifically, our proposed method simplifies the procedure by avoiding estimating the infinite dimensionality brought by the nonparametric component and results in sparsity in the linear component.

The estimation procedure for PLM with fixed dimension p of X has been extensively studied. Engle et al. used the penalized least squares method to estimate β and the nuisance function f(·) simultaneously by adding a penalty on the roughness of f(·), which was referred as the partial smoothing splines [4, 7, 12, 23]. Since β is of primary interest, some other methods brilliantly avoid the estimation of f(·). For example, Robinson [24] introduced a profile least squares estimator based on the idea of partial residual, which later became one of the commonly used approaches to eliminate the nonparametric component in PLM. Another type of approach to eliminate the nonparametric component is the difference-based method [27, 29]. It estimates the coefficients in linear component by taking differences of the ordered observations. The resulting estimator is proven to be asymptotically efficient under finite dimension. See Section 2.1 for more details about the difference-based method.

Variable selection for PLM can be accomplished by adding another penalty function on β to the loss function of the aforementioned partial smoothing splines method. The least absolute shrinkage and selection operator (LASSO) [25], the nonnegative garrote [2, 31], the smoothly clipped absolute deviation (SCAD) [8], the elastic net [34], and the minimum concave penalty (MCP) [32] are all among popular choices of penalty functions. Xie and Huang [28] used the SCAD penalty to achieve the sparsity in the linear part and used the polynomial splines to estimate the nonparametric component simultaneously. The resulting estimator $\hat{β}$ was shown to be consistent with $p = o (\sqrt{n})$ . An alternative is to use a two-step procedure, in which Y and X are first regressed on U separately to get partial residuals, then the variable selection is further applied on the transformed model. For example, the consistency for both the linear and the nonparametric components has been well studied by Zhu et al. [33] under the regime of $p = o (\sqrt{n})$ . Liang and Li [19] discussed this approach in the presence of measurement errors. More recently, Liu et al. [20] proposed selection procedure via recursively test on the partial correlation among the partial residuals and among the covariates when ln p = o(n). The method was referred as the thresholded partial correlation on partial residuals (TP-CPR). However, to the best of our knowledge, there is nearly no literature on variable selection in the high-dimensional setting based on the extension of difference-based method.

Bayesian approach puts priors on the parameters and the model space, and selects the model with the highest posterior probability. There has been multiple development for variable selection using Bayesian approach with linear and generalized linear models. George and McCulloch [11] proposed a milestone method of Bayesian variable selection via stochastic search. They introduced a latent binary vector to indicate the inclusion of variables in linear models, and then placed a mixture spike and slab prior on each coefficient conditioning on this latent vector. Following this approach, many other selection procedures with similar structure have been proposed. The distinction between them is mostly in the form of the spike and slab priors, or in the form of the prior on the model space. To alleviate the difficulty in choosing specific prior parameters, several approaches have been proposed, see [10, 14, 30]. However, these papers focused on small-scaled questions and did not discuss any possible extension to the high-dimensional setting. More recently, Ishwaran and Rao [15] established the oracle property of the posterior mean as n converges to infinity with fixed p under certain conditions on the prior variances for linear models. Johnson and Rossell [16] proved selection consistency under p = O(n) for a non-local prior in linear model settings. Liang et al. [18] proposed a point-mass spike prior with a slab prior depending on the model size, and proved the posterior consistency under ln p = o(n) in generalized linear models, but the corresponding conditions are relatively strong. Additionally, the step-wise estimation procedure is not efficient. Narisetty and He [21] also used Gaussian prior but argued the prior should be sample-size dependent, referred to as Bayesian shrinking and diffusing priors (BASAD), and obtained strong selection consistency when ln p = o(n) for linear models under mild assumptions. However, BASAD is not computationally practical for large-p problems, since it requires to update β from a p-dimensional multivariate normal distribution in each iteration. Recently Narisetty et al. [22] proposed Skinny Gibbs (SG) algorithm to address this computation issue via sparsifying the precision matrix. They referred to this kind of update as Skinny Gibbs (SG) and argued that it is a scalable method, namely the required computation time grows approximately linearly in p. The selection consistency was proved for the logistic regression. While spike and slab priors have been widely used in applications for its attractive interpretability, the theory for spike and slab models has not caught up with the applications. Again, all the aforementioned papers focused on linear or generalized linear models, and the corresponding work on semiparametric or nonparametric models under high dimensional setting is limited.

In this paper, we propose a Bayesian subset selection procedure for the partially linear model. We incorporate the difference-based method in the prior for the nonparametric component. For the parametric component, we adopt a modified version of Bayesian shrinking and diffusing priors (BASAD) [21] and propose the novel Bayesian subset modeling with diffusing prior (BSM-DP). We use a normal distribution with a diverging variance as the slab prior and a normal distribution with a small variance as the spike prior. Differently from BASAD, the response variable in our model only depends on the active covariates. This conveniently allows us to sample coefficients separately for the active and the inactive sets during the estimation. In fact, the spike prior has no impact on the theoretical result, so any proposal including a point mass will work. As a practical note, we recommend a Gaussian distribution with a small variance, which allows more flexibility for the Markov chain to explore the model space, and hence avoiding local trap. As a result, the proposed methods are more computationally efficient than BASAD. We also notice that the Skinny Gibbs (SG) [22] is a special case of BSM-DP when the variance of spike prior is set to be proportional to 1/n. Their original paper [22] discussed logistic regression only. We establish the selection consistency for the parametric component in partially linear models when ln p = o(n) under mild conditions.

The rest of the paper is organized as follows. In Section 2 we present the Bayesian subset modeling with diffusing prior (BSM-DP) and discuss variable selection for partially linear model, followed by the estimation procedure, regularity conditions and theoretical results. Performances of several numerical studies are presented in Section 3 to demonstrate reliability of the proposed model. We further apply the proposed method on the supermarket data set. Proofs for lemmas and theorems are given in Section 4, followed by discussions in Section 5.

2. Bayesian subset modeling with diffusing prior

2.1. Model and notation

Suppose that {(Y_i, X_i, Ui)}, i ∈ {1, …, n} is a random sample from PLM (1) with high-dimensional covariates $X \in ℝ^{p_{n}}$ and univariate covariate U ∈ [0, 1], where we use p_n to emphasize that the number of variables is allowed to diverge with sample size n. Assume that the random error ϵ is independent of (X^⊤, U) and each observation X_i has the same distribution with mean 0 and covariance Σ. Denote f(U_i) as α_i, and α = (α₁, …., α_n)^⊤ as a vector with size n. Notation Y is the corresponding size n vector, and X is the design matrix with size n × p_n.

We will propose a prior for the nonparametric function (i.e. the α) in our proposed Bayesian subset selection based on difference-based method. Assume the observation {(Y_i, X_i, Ui)}_1≤i≤n is ordered by the increasing order of {Ui}_1≤i≤n. The difference in observed value for contingent Y can be written as

Y_{i} - Y_{i - 1} = {f (U_{i}) - f (U_{i - 1})} + {(X_{i} - X_{i - 1})}^{⊤} β + ϵ_{i} - ϵ_{i - 1}, i \in {2, \dots, n} .

If U_i−1 and U_i are close and f(·) is smooth enough, f(U_i) should also be close to f(U_i−1). So the nonparametric part tends to be cancelled out. In this case, the ordinary least squares estimate can be applied on the differenced data, as long as X is not perfectly correlated with U. Define the mth higher order difference sequence to be {d_i}_{i∈{1,…,m+1}}, which satisfies $\sum_{i = 1}^{m + 1} d_{i} = 0$ and $\sum_{i = 1}^{m + 1} d_{i}^{2} = 1$ . So the mth order difference operation reduces the sample size to n – m by defining

Y_{i}^{*} = C^{1 / 2} \sum_{t = 1}^{m + 1} d_{t} Y_{i + m + 1 - t}, X_{i}^{*} = C^{1 / 2} \sum_{t = 1}^{m + 1} d_{t} X_{i + m + 1 - t}, δ_{i} = C^{1 / 2} \sum_{t = 1}^{m + 1} d_{t} f (U_{i + m + 1 - t}), ω_{i} = C^{1 / 2} \sum_{t = 1}^{m + 1} d_{t} ϵ_{i + m + 1 - t},

for i ∈ {1, …, n − m}, where C is some positive constant. Define the difference matrix D as

D = (\begin{matrix} d_{m + 1} & \dots & d_{1} & 0 & \dots & \dots & 0 \\ 0 & d_{m + 1} & \dots & d_{1} & 0 & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ & ⋮ \\ 0 & \dots & 0 & d_{m + 1} & \dots & d_{1} & 0 \\ 0 & \dots & \dots & 0 & d_{m + 1} & \dots & d_{1} \end{matrix}) \in ℝ^{(n - m) \times n} .

(2)

Therefore the PLM (1) can be rewritten in matrix form as

Y^{*} = X^{*} β + δ + ω,

where $Y^{*} = C^{1 / 2} D Y \in ℝ^{n - m}$ , $X^{*} = C^{1 / 2} D X \in ℝ^{(n - m) \times p_{n}}$ , $δ = C^{1 / 2} D α \in ℝ^{n - m}$ , $ω = C^{1 / 2} D ϵ \in ℝ^{n - m}$ . Under some smoothness conditions on f(·) with fixed p, Yatchew [29] and Wang et al. [27] showed that the ordinary least square estimator $\hat{β} = {(X^{* ⊤} X^{*})}^{- 1} X^{* ⊤} Y^{*}$ is asymptotically efficient when m → ∞, if X and U are independent. This indicates that the effect of the nonparametric component is negligible after applying a high order difference operation on the data.

In the literature, U_i’s are either from a fixed design e.g., U_i = i/n, or observations from a distribution on [0, 1] with density function bounded away from 0. In this paper, we only consider the case when X and U are independent under a dense design with a bounded max_2≤i≤n |U_i − U_i−1|.

We use X_j as the notation for the jth covariate. A size-p_n latent binary random vector is introduced as γ. The jth entry γ_j indicates whether X_j is included in the model (1 = present, 0 = not present). Therefore, the model space is fully specified by γ, and we use γ and $M$ as notations for models interchangeably. The true model is denoted as $A$ . The cardinality of model $M$ , denoted by $| M |$ , is the size of the model. Consequently, if $β_{M}$ is the subvector of β with size $| M |$ , $X_{M}$ is the submatrix of X with respect to model $M$ , and $Σ_{M}$ is the $| M | \times | M |$ covariance matrix for $X_{M}$ . Other notations used in the paper are unified as follows.

Model operation: $M_{1} \land M_{2}$ and $M_{1} \lor M_{2}$ are defined as the intersection and union of model $M_{1}$ and $M_{2}$ , for example $M_{1} \land M_{2} = {i : i \in M_{1} and i \in M_{2}}$ .
Rate: a_n ⪯ b_n or b_n ⪰ a_n means a_n = O(b_n), a_n ≺ b_n or b_n ≻ a_n means a_n = o(b_n). And a_n ~ b_n refers a_n/b_n → c for some positive constant.
Matrix and matrix operation: n × n identity matrix is denoted as I_n. For a matrix M, ∥M∥ is the spectral norm, which is the largest singular value of M. The Moore-Penrose inverse of M is denoted by M⁺, which is the unique generalized inverse. And if M is a positive definite matrix we use λ_min(M) and λ_max(M) as the notation for the minimum and the maximum eigenvalue of M.

Here and hereafter, the densities are conditional on X and U. The working model for variable selection in the partially linear model in (1) via Bayesian subset modeling with diffusing prior (BSM-DP) is proposed as

π (Y ∣ α, γ, β, σ) = N (α + X_{γ} β_{γ}, σ^{2} I_{n}), π (β_{j} ∣ γ, σ) = {\begin{array}{l} N (0, σ_{1 n}^{2} σ^{2}) & γ_{j} = 1 \\ N (0, σ_{0}^{2} σ^{2}) & γ_{j} = 0 \end{array}, j \in {1, \dots, p_{n}}, π (γ) = \prod_{j = 1}^{p_{n}} q_{n}^{γ_{j}} {(1 - q_{n})}^{1 - γ_{j}}, π (α ∣ σ) = N (0, σ^{2} Σ_{0 n}), π (σ^{2}) = I G (a_{0}, b_{0}),

(3)

where, $Σ_{0 n} = {{(I_{n} - C D^{⊤} D)}^{- 1} - I_{n}}^{+}$ , D is the difference matrix defined in (2).

We choose the classical Inverse Gamma distribution as the prior for σ² as it is the most commonly used conjugate prior. Other choices of prior could be used, and it can be shown that Theorem 5 applies to a wider family of priors, including some commonly choices like improper non-informative prior and the class of folded-noncentral-t prior (see Remark 8). Meanwhile, independent Bernoulli distribution with probability q_n is used as the prior for each γ_j. So the preliminary marginal inclusion probability for each variable is q_n. It is natural to assume that when the dimension p_n diverges with the sample size, q_n should converge at some rate to 0. Each β_i has a mixture normal distribution. Conditioning on γ_j = 1, β_j has a normal distribution with a relative large variance $σ_{1 n}^{2} σ^{2}$ . This corresponds to a very wide and flat distribution, usually referred to as a slab prior. We call it the diffusing prior as named in [21]. Within its variance, σ_1n depends on the sample size, and diverges at some certain rate when the sample size goes to infinity. Conditioning on γ_j = 0, β_j has a normal distribution with variance $σ_{0}^{2} σ^{2}$ . As the choice of $σ_{0}^{2}$ would not influence the asymptotic results, it can be chosen depending on sample size or simply as a fixed value.

With a partially linear model, we will need to accommodate the nonparametric part. A conjugate prior of normal distribution with a semi-definite covariance matrix σ²Σ_0n is proposed for α = (f(U₁), …, f(U_n)). The covariance matrix σ²Σ_0n is further taken as a function of the difference matrix D thus to eliminate the effect of nonparametric function. More intuitions about the choice of Σ_0n will be discussed later. The error term is assumed to be normally distributed. Therefore, conditional on the latent indicator γ, coefficients β, nonparametric component α, and the variance for the error σ, Y has a normal distribution.

Remark 1. (Comparison with BASAD [21] and SG [22]). As mentioned earlier, the inclusion of γ in the conditional distribution of Y distinguishes our model from BASAD. This difference allows us to sample separately for the active and the inactive groups.

In our working model (3), the response variable Y is conditioned on γ, hence only depends on the active covariates X_γ. But in BASAD, Y depends on both the active and nonactive part of the covariates. As a result, the full conditional distribution for β_γ and $β_{γ^{c}}$ are not independent in BASAD. Therefore, to update β in MCMC, each iteration requires sampling a size-p vector from a multivariate normal distribution. This will increase the computational time quickly under large p. On the other hand, the full conditional distribution for β_γ and $β_{γ^{c}}$ are independent in our proposal. So in each iteration, we only need to sample a size-|γ| vector from a multivariate normal distribution and sample (p − |γ|) scalars from independent univariate normal distributions. The current active model size |γ| is usually small after several iterations if the true model is sparse. In fact, just as SG, the proposed BSM-DP is scalable in high dimensional problems, which means that the computation time growing approximately linearly with the dimension p. The computational complexity for each iteration in the estimation procedure is n(p ∨ |γ|² ∨ n²). We have also validated this claim in the simulation study for PLM and more simulation studies about linear models as compared with BASAD and SG in the supplementary material. It can be shown that SG is equivalent to our Bayesian subset modelling by taking the variance of spike prior to be $σ_{0}^{2} = {(n + τ_{0 n}^{- 2})}^{- 1}$ , where $τ_{0 n}^{2}$ is the variance for spike prior in SG.

2.2. Estimation procedure

Gibbs sampling is used to update our parameters iteratively. In each iteration, we draw samples from those full conditional distributions.

Update γ_k from a Bernoulli distribution:

Full conditional distribution of γ_k is a Bernoulli distribution with probability Pr(γ_k = 1|β, α, σ², Y, γ_−k) = p₁/ (p₁ + p₂), and
$\frac{p_{1}}{p_{2}} = \frac{q_{n} σ_{0}}{(1 - q_{n}) σ_{1 n}} exp {- \frac{β_{k}^{2}}{2 σ_{1 n}^{2} σ^{2}} + \frac{β_{k}^{2}}{2 σ_{0}^{2} σ^{2}} + β_{k} X_{k}^{⊤} (Y - α - X_{{\hat{A}}_{k}} β_{{\hat{A}}_{k}}) / σ^{2} - β_{k}^{2} X_{k}^{⊤} X_{k} / (2 σ^{2})},$
where X_i is the ith column of X, and the index ${\hat{A}}_{k}$ is the collection of current active covariates after removing the kth covariate, that is ${\hat{A}}_{k} = {i : γ_{i} = 1, i \neq k}$ .
Update β from multivariate normal distributions:
In each iteration, we divide β into the active group and the inactive group based on the current γ. Denote $\hat{A}$ as the collection of covariates with γ_j = 1, and $\hat{I}$ as the collection of covariates with γ_j = 0. Rewrite $β = (β_{\hat{A}}, β_{\hat{I}})$ , and we can update those two groups separately.
1. Update the active group $\hat{A} : β_{\hat{A}} ~ N (V X_{\hat{A}}^{⊤} (Y - α), σ^{2} V)$ , where $V = {(X_{\hat{A}}^{⊤} X_{\hat{A}} + \frac{1}{σ_{1 n}^{2}} I_{| \hat{A} |})}^{- 1}$ .
2. Update the inactive group $\hat{I} : β_{\hat{I}} ~ N (0, σ_{0}^{2} σ^{2} I_{| \hat{I} |})$ .
Update σ² from an Inverse Gamma distribution:

σ² ~ IG(a, b), where
$a = a_{0} + (n + p_{n}) / 2,$

$b = b_{0} + β_{\hat{A}}^{⊤} β_{\hat{A}} / (2 σ_{1 n}^{2}) + β_{\hat{I}}^{⊤} β_{\hat{I}} / (2 σ_{0}^{2}) + {(Y - α - X_{\hat{A}} β_{\hat{A}})}^{⊤} (Y - α - X_{\hat{A}} β_{\hat{A}}) / 2 + α^{⊤} Σ_{0 n} α / 2.$
Update α from a multivariate normal distribution:
$α ~ N (Σ_{α n} (Y - X_{\hat{A}} β_{\hat{A}}), σ^{2} Σ_{α n}), where Σ_{α n} = {(Σ_{0 n}^{+} + I_{n})}^{- 1}, $ furthermore by Condition A, Σ_{α n} = I_{n} - C D^{⊤} D .$

In the literature, the nonparametric function f(·) is usually assumed to be smooth, which means f(x) and f(y) should be close if x and y are close enough. This dependency among f(U₁), f(U₂), …, f(U_n) suggests that the covariance matrix Σ_0n of α has to be a dense matrix. Here we take Σ_0n to be Σ_0n = {(I_n – CD^⊤D)⁻¹ − I_n}⁺ where C is some positive constant and D is the difference matrix defined in (2). We will show the reason for this specific choice of Σ_0n in Remark 2. Fig. 1. shows the intuitive structure of difference matrix D, the matrix Σ_αn used for the update of α, and covariance matrix for the prior Σ_0n when constant C = 0.6 with sample size n = 200 and difference order m = 20. As demonstrated in the figure, $D \in ℝ^{(n - m) \times n}$ is an general upper triangular band matrix with bandwidth m. The update matrix Σ_αn is also a band matrix with bandwidth m. The covariance matrix for the prior of the nonparametric component $Σ_{0 n} \in ℝ^{n \times n}$ is a dense matrix with positive larger values near the diagonal, then decays gradually to 0 and negative when moving further. The reason why it has negative off-diagonal values is because the difference sequence {d_i}_1≤i≤m+1 is standardized to be centered at 0. Theoretically, we do require the difference order m, which is also the bandwidth, goes to infinity as n → ∞ at some slow rate. In this way, the effect of the nonparametric component can be removed without over-smoothing the nonparametric function f(·), so the selection consistency for the linear component holds.

Fig. 1. — Visualization to display the magnitude of values in the difference matrix D, the covariance matrix Σ_αn used for the update of α and the covariance matrix for prior Σ_0n. All plots are taking constant C = 0.6, sample size n = 200 and difference order m = 20. On the graph, red color indicates positive values on the the corresponding locations of the matrix, and purple color represents negative values.

2.3. Selection procedure

In the typical Bayesian variable selection approach, the model with the highest posterior is selected as the final model, referred to as maximum a posterior model (MAP): $\hat{M} = {argmax}_{M} Pr (γ = M ∣ Y)$ . With the spike and slab prior, the posterior of the model space is usually reflected by the posterior probability of the latent variable γ. Alternatively, another way is to consider the marginal probability of Pr(γ_j = 1|Y). Specifically, one will select the jth covariate if Pr(γ_j = 1|Y) is equal to or greater than a certain threshold. A threshold of 0.5 is a natural choice. This is known as the median probability model (MPM). It has been shown that MPM has good predictive power [1]. Although it is likely that these two approaches may produce different results in practice, it can be shown those two selection methods are asymptotically the same under strong selection consistency, which will be shown in Section 2.4. Moreover, some other data-driven criteria could also be used in determining the threshold, e.g. AIC, BIC and EBIC [5].

2.4. Theoretical results

Variable selection procedures typically aim to achieve selection consistency, and under Bayesian framework, it means conditional on observed data, the probability of the true model $A$ being selected goes to 1 in probability.

Pr (\hat{M} = A ∣ Y) \overset{P}{\to} 1 as n \to \infty .

That is, the true model is selected consistently. Note that the posterior of model space is fully specified by γ. If the model is selected via MAP: $\hat{M} = {argmax}_{M} Pr (γ = M ∣ Y)$ , then the selection consistency only requires that the posterior probability of the true model, i.e. $Pr (γ = A ∣ Y)$ is no less than that of any other models, i.e. $Pr (γ = M ∣ Y)$ . But the difference in their posterior probabilities could still shrink to 0. In this paper, we will consider the following strong selection consistency

Pr (γ = A ∣ Y) \overset{P}{\to} 1 as n \to \infty .

It indicates the difference for the posterior probabilities of the true model and any other model is 1. This nonzero difference indicates a stronger conclusion than selection consistency. We first present the following regularity conditions for the selection consistency of the linear component in the PLM, and we then start with the case when σ² is known as it provides intuitive interpretation for the proposed method.

Condition A (On the dimension and priors). The dimension p_n satisfies that ln(p_n) = o(n). The prior probability that a coefficient is nonzero q_n satisfies that q_n ~ 1/p_n. The variance for slab prior $σ_{1 n}^{2} \to \infty$ as n → ∞, and $n σ_{1 n}^{2} λ_{1} ~ p_{n}^{2 + 3 δ}$ for some δ > 0, where λ₁ is defined in Conditions C. The covariance for the prior of nonparametric component Σ_0n = {(I_n – CD^⊤D)⁻¹ − I_n}⁺, where C is a positive constant, with values no greater than min {1, 1/λ_max(D^⊤D)}, and D is the difference matrix defined in (2).

Condition B (Identifiability). There exists K > 1 + 4/δ such that

Δ_{n} (K) = inf_{M : | M | \leq K | A |, A ⊈ M} {‖ (I - P_{M}) X_{A}^{* ⊤} β_{A} ‖}_{2}^{2} > σ^{2} | A | (4 + 4 δ + κ) ln p_{n},

where X* = C^1/2DX and the projection matrix $P_{M} = X_{M}^{*} {(X_{M}^{* ⊤} X_{M}^{*})}^{- 1} X_{M}^{* ⊤}$ .

Condition C (Regularity of the design). Define

λ_{1} = min_{M : | M | \leq m_{n} + | A |} λ_{min} (\frac{1}{n} X_{M}^{⊤} X_{M}), λ_{2} = max_{M : M \subseteq A} λ_{max} (\frac{1}{n} X_{M}^{⊤} X_{M}),

then ${(λ_{2} / λ_{1})}^{| A |} ⪯ p_{n}^{κ}$ , for some 0 < κ < δ, where m_n is defined in Condition D.

Condition D (On the true model). Let $β_{A}$ consist of all nonzero elements of β. That is, $X_{A}$ consists of all active predictors. The size of $A$ satisfies that $| A | = o (m_{n}), m_{n} = c n / ln p_{n}$ , m_n = cn/ ln p_n, where c < δ/ {(4 + δ)(2 + δ)}. Further assume that $U_{\infty} = {max}_{2 \leq i \leq n} | U_{i} - U_{i - 1} | = O (n^{- c_{1}})$ for some 0 < c₁ ≤ 1.

Condition E (On the difference matrix). Let D be the difference matrix, as defined in (2). Denote $h_{k} = \sum_{i = 1}^{m + 1 - k} d_{i} d_{i + k}$ , then the mth difference sequence d₁, …, d_m+1 satisfies,

\sum_{i = 1}^{m + 1} d_{i} = 0, \sum_{i = 1}^{m + 1} d_{i}^{2} = 1, \sum_{k = 1}^{m} h_{k}^{2} = O (m^{- 1}), 1 - d_{1}^{2} = O (m^{- 1}) .

Furthermore m → ∞, $m = o (n^{c_{2}})$ for some 0 < c₂ < c₁ ≤ 1, where c₁ is defined in Condition D.

Condition F (On the nonparametric component f(·)). Suppose f(·) ∈ Λ^k(M) for some $k > \frac{1}{2 (c_{1} - c_{2})}$ , where c₁ and c₂ is defined in Conditions D, E. The Lipschitz ball Λ^k(M) is defined as

Λ^{k} (M) = {f : for all 0 \leq x, y \leq 1, | f^{(i)} (x) | \leq M, i \in {0, \dots, ⌊ k ⌋ - 1}, | f^{(⌊ k ⌋)} (x) - f^{(⌊ k ⌋)} (y) | \leq M | x - y |^{k^{'}}},

where ⌊k⌋ is the largest integer less than k and k′ = k − ⌊k⌋.

The convergent and divergent rate of the parameters in the priors and the dimension of the variables are stated in Condition A. Identifiability Condition B is needed to distinguish active covariates out of spurious ones. C gives the regularity condition of the design matrix. Instead of requiring bounded eigenvalues, we will need the minimal eigenvalues to decay slower than some rate and the maximal eigenvalues to diverge slower than some rate. We would like to point that, if the size of the true model is not too large, the condition holds even with the extreme case when X is sampled from normal distribution with compound symmetric covariance matrix when correlation among predictors ρ → 1. Theoretically the model still works even under nearly perfectly correlated covariates. Condition D states the normality assumption for the error and we do allow infinitely many active variables.

Conditions E-F control the error in estimating the nonparametric component. Condition E is about the difference matrix. A sequence satisfying the above conditions, for example, is $d_{1} = \sqrt{\frac{m}{m + 1}}$ , $d_{2} = d_{3} = \dots = d_{m + 1} = - \sqrt{\frac{1}{m (m + 1)}}$ . As argued in [23] about partial smoothing spline method for PLM, higher ordered difference operation gives lower approximation error. We do assume m → ∞, so the approximation error becomes ignorable. In the estimation of the partial residuals, it requires that the nonparametric estimators of E(Y|U) and E(X|U) to converge sufficiently fast so that their substitutions in the OLS estimator do not affect its asymptotic distribution. Similarly, our upper bound for the growth rate of difference order m reflects this. Finally, the commonly used smoothness assumption for nonparametric nuisance function is stated in Condition F.

The first step is to derive posterior probability of any model $M$ .

Lemma 1. Under fixed σ², for any model $M$ , the posterior probability has the following explicit form:

Pr (γ = M ∣ Y, σ^{2}) \propto q_{n}^{| M |} T_{M} exp (- \frac{1}{2 σ^{2}} R_{M}),

where

R_{M} = Y^{⊤} {Σ_{1 n} - Σ_{1 n} X_{M} {(X_{M}^{⊤} Σ_{1 n} X_{M} + I_{| M |} / σ_{1 n}^{2})}^{- 1} X_{M}^{⊤} Σ_{1 n}} Y,

T_{M} = σ_{1 n}^{- | M |} {| X_{M}^{⊤} Σ_{1 n} X_{M} + I_{| M |} / σ_{1 n}^{2} |}^{- 1},

and $Σ_{1 n} = I_{n} - {(Σ_{0 n}^{+} + I_{n})}^{- 1}$ , Σ_0n is the covariance matrix for the prior of f(U).

Furthermore, define the likelihood ratio between model $M$ and the true model $A$ as $PR (M, A)$ . If Conditions A, C hold, $PR (M, A)$ is bounded by

P R (M, A) = \frac{Pr (γ = M ∣ Y, σ)}{Pr (γ = A ∣ Y, σ)} ⪯ p_{n}^{- 1.5 δ (| M | - | A |) + 0.5 κ} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})} .

(4)

Remark 2. Lemma 1 gives the explicit form of the posterior probability for any given model $M$ and puts an upper bound on the likelihood ratio between model $M$ and the true model $A$ . Intuitively from (4), when $σ_{1 n}^{2}$ is sufficiently large, $R_{M}$ is close to

R_{M}^{*} = Y^{⊤} {Σ_{1 n} - Σ_{1 n} X_{M} {(X_{M}^{⊤} Σ_{1 n} X_{M})}^{- 1} X_{M}^{⊤} Σ_{1 n}} Y .

So if Σ_1n is taken to be CD^⊤D, which means Σ_0n = {(I_n – CD^⊤D)⁻¹ − I_n}⁺, then $R_{M}^{*}$ is proportional to the sum of squared residuals under model $M$ , after applying the difference-based method. It could be interpreted as the goodness of fit. Additionally, the first term $p_{n}^{- 1.5 δ (| M | - | A |)}$ in (4) could be regarded as the penalty on the model size. So it is mostly analogous to a L₀ penalized method. As σ_1n diverges fast, we can directly work with $R_{A}^{*}$ and $R_{M}^{*}$ instead of $R_{M} - R_{A}$ . The following two lemmas present some properties of $R_{A}^{*}$ and $R_{M}^{*}$ .

Lemma 2. For any model $M$ containing the true model, i.e $A \subseteq M$ , if conditions A, E, F hold, then

R_{M}^{*} ~ C σ^{2} χ_{n - m - | M |}^{2} a . s .,

R_{A}^{*} - R_{M}^{*} ~ C σ^{2} χ_{| M | - | A |}^{2} a . s ..

Lemma 3. Suppose that Conditions A, C, D are satisfied, then for any g_n → ∞ and ϵ > 0,

$Pr (R_{A} - R_{A}^{*} > g_{n}) \leq exp (- c^{'} n λ_{1} σ_{1 n}^{2} g_{n})$ , for some c′ > 0, where λ₁ is defined in Condition C.
$Pr (| R_{A}^{*} / (C n σ^{2}) - 1 | > ϵ) \leq exp (- c n)$ , for some c > 0.

Remark 3. Lemma 2 shows for over-fitted models, after applying the difference operation, the sum of squared residuals has an asymptotic χ² distribution with the degrees of freedom as $n - m - | M |$ . It also gives the asymptotic distribution of $R_{A}^{*} - R_{M}^{*}$ . The difference of $R_{A}$ and $R_{A}^{*}$ is further bounded in Lemma 3. Lemma 3 (ii) is straightforward by using the tail bound with χ² distributions.

Theorem 4. (Strong Selection Consistency under fixed σ²). Suppose that Conditions A, B, C, D, E, F hold for the partially linear model in (1) with ln p_n = o(n), and $| A | = o (m_{n})$ are valid, we have

Pr (γ = A | Y, σ^{2}, | γ | \leq m_{n} + | A ∣) \overset{P}{\to} 1 as n \to \infty,

where m_n = cn/ ln p_n, as defined in Condition D.

Remark 4. It suffices to show

\sum_{M \neq A, | M | \leq m_{n} + | A |} PR (M, A) = \sum_{M \neq A, | M | \leq m_{n} + | A |} \frac{Pr (γ = M ∣ Y, σ)}{Pr (γ = A ∣ Y, σ)} \overset{P}{\to} 0.

Recall that, by (4) in Lemma 1, we have $P R (M, A) ⪯ p_{n}^{- 1.5 δ (| M | - | A |) + 0.5 κ} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})}$ . Inspired by [21], we first divide the model space into 3 disjoint parts, $P_{i}, i = 1, 2, 3$ defined as below. In each group, we prove the sum of likelihood ratio converges to 0 in probability.

Consider the set of overfitted models $P_{1} = {M : A \subseteq M, | M | \leq m_{n} + | A |}$ . Model $M$ in this group contains all active variables, so $R_{A} - R_{M}$ might be large. However since $| M | - | A | > 0$ , the number of extra spurious variables is penalized at the rate of $p_{n}^{- c}$ .
For large and including some inactive variables models $P_{2} = {M : A ⊈ M, K | A | < | M | \leq m_{n} + | A |}$ , let $M \lor A$ be the union of model $M$ and $A$ , then any model $M$ in this group has $M \lor A \in P_{1}$ . Although size of $M \lor A$ may exceed $m_{n} + | A |$ , but since m_n dominates $| A |$ so it is reasonable to assume the difference is negligible. Model $M \lor A$ will also have a better fit than $M$ . Thus we can control $exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})}$ by bounding the value of $exp {- \frac{1}{2 σ^{2}} (R_{M \lor A} - R_{A})}$ . Since the size of models in $P_{2}$ is large, the growth of $exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})}$ is under control.
Now consider the set of underfitted models missing some active variables, which is formulated as $P_{3} = {M : A ⊈ M, | M | \leq K | A |}$ , where K is the constant defined in Condition B. For any model $M$ in this group, at least one active variable is missed. By Condition B on the identifiability of the active variables, $R_{M} - R_{A}$ will be large since model $M$ does not have a good fit, thus the value of $P R (M, A)$ is controlled.

Remark 5. In this paper, we only consider models that are not unreasonably large, that is, $| M | \leq m_{n} + | A |$ , where m_n is at the order of n/ ln p_n. There is a reason behind the choice of m_n. It can be shown that the marginal probability for the inclusion of a single variable i.e. q = Pr(γ_i = 1) somehow controls the size of the selected model. Let $\hat{M} = \underset{M}{argmax} Pr (γ = M ∣ Y, σ^{2})$ . For any fixed choice of q, we can derive the upper bound on the selected model size

| \hat{M} | ⪯ | A | + \frac{c | A | ln p_{n}}{- ln q} + \frac{c^{'} n}{- ln q} .

Since we require $q_{n} ~ p_{n}^{- 1}$ and $| A | = o (\frac{n}{ln p_{n}})$ in Conditions A, D, it reduces to

| \hat{M} | = O (\frac{n}{ln p_{n}}) .

Theorem 5. (Strong Selection Consistency under unknown σ²). Suppose that Condition A, B, C, D, E, F hold for the partially linear model in (1) with ln p_n = o(n), and $| A | = o (m_{n})$ are valid, we have

Pr (γ = A | Y, | γ | \leq m_{n} + | A ∣) \overset{P}{\to} 1 as n \to \infty,

where m_n = cn/ ln p_n, as defined in Condition D.

By further integrating out σ² and applying some inequalities, the problem reduces to intermediate steps in Theorem 4. Please refer to Section 4 for the proof.

Remark 6. As linear models are special cases of partially linear models, the proposed BSM-DP variable selection method may be directly applicable to linear models. We have also studied theoretical properties, finite sample performance and computational time of the BSM-DP under the setting of linear models and compared with BASAD [21] and SG [22]. To save space, all material related to BSM-DP for linear models are put in the supplementary material.

3. Numerical study

3.1. Simulation study

3.1.1. Simulation settings and the choice of hyperparameters

In this section, we compare the performance of the proposed method with several other existing methods including penalized methods on partial residuals and methods based on partial correlation of partial residuals. The penalized methods include the famous LASSO [25] and SCAD [8] tuned by BIC. The R packages msgps, ncvreg are used for LASSO and SCAD. Methods based on partial correlation include PC-simple algorithm on the partial residuals (PC-PR) [3] and threshold partial correlation on partial residuals (TPC-PR) [20]. Both PC-PR and TPC-PR select variables based on the magnitude of the partial correlation between the partial residuals of the response and the corresponding predictors, while the difference is on the threshold used for partial correlations. TPC-PR uses a threshold depending on the kurtosis of X, so the normality assumption for X is not necessary. For TPC-PR, we also consider a fine tuning on the critical value cT(α, n, $\hat{κ}$ , m), where c is the tuning parameter chosen by EBIC [5]. The method is denoted as TPC-PR.EBIC.

First we need to specify the hyper-parameters σ₀, σ_1n, q_n, m, α₀, β₀. Partially refer to the choice in [21], we use α₀ = 2, β₀ = 5, and σ₀ = 0.1 for our proposed method (BSM-DP). The order of the difference operator is set to be m = ⌊5n^1/5⌋. Additionally, the variance for the diffusing prior is set as $σ_{1 n} = \sqrt{max {\frac{p_{n}^{2.1}}{100 n}, ln (n)}}$ , and we choose q_n = Pr(γ_i = 1) such that $Pr (\sum_{j = 1}^{p_{n}} γ_{j} > K) = 0.1$ , for a prespecified value of K. The value of K can be our preliminary guess for the size of active set, and for example, we can use the size of active set selected by LASSO. In this paper we simply set K = 10. In each of the following case, we allow 6000 iterations, and treat the first 3000 as burn-in samples. We report simulation results based on both MAP and MPM for our proposed method.

We fix n = 200, p = 1000 and the true active set $A = (1, 2, 5, 8)$ with coefficients of $β_{A} = (1.5, 2.0, 2.5, 3.0)$ . The error ϵ_i is drawn from standard normal distribution $N (0, 1)$ . Fixed design of U_i = i/n, i ∈ {1, …, n} is used with three different types of X:

Case 1. Type I normal distribution with autoregressive covariance matrix: $X ~ N (0, Σ)$ , where Σ_ij = ρ^|i−j|.
Case 2. Type II normal distribution with compound symmetric covariance matrix: $X ~ N (0, Σ)$ , where Σ_ij = 1 for i = j and Σ_ij = ρ for i ≠ j.
Case 3. Type III mixture of normals: X is sampled from $N (0, Σ)$ with probability 0.9 and from $N (0, 9 Σ)$ with probability 0.1, where Σ is the compound symmetric correlation matrix with correlation ρ.

In each case, we also consider low and high correlations at ρ = 0.2 and ρ = 0.8 separately, with two choices of the nonparametric component f(U) = U² and f(U) = sin(2πU). The following evaluation criteria are used for comparing methods based on 500 replications:

$p_{A^{c}}^{max}$ and $p_{A}^{min}$ : the average for the maximal of marginal posterior probabilities on true inactive covariates, and the average for the minimal of marginal posterior probabilities on true active covariates.
$p_{A = M}$ : the proportion of replications with the exact model being selected.
$p_{A \in M}$ : the proportion of replications with all true active variables being selected.
p_i: the proportion of replications that the ith true active variable is selected successfully, i ∈ {1, 4}.
TP (true positive): the average number of true active variables selected.
FP (false positive): the average number of selected variables that are actually inactive.
ME (model error): ${(\hat{β} - β)}^{⊤} cov (X) (\hat{β} - β)$ .

Note that for those existing methods, partial residuals are firstly obtained. Getting the partial residuals involves the value of E(Y|U), which is estimated by the local linear regression, followed by [20]. The bandwidth is chosen via plug-in method using R package KernelSmooth. Afterwards, with LASSO and SCAD, we are able to estimate β and achieve variable selection simultaneously in the second step. While for the partial correlation methods including PC and TPC, an estimation of the active set $\hat{A}$ needs to be obtained first, then $\hat{β}$ is estimated by regressing the partial residuals on $\hat{A}$ through the least squares method. Details can be found in [20]. In this simulation, we use the posterior mean of β as $\hat{β}$ for the proposed method.

3.1.2. Simulation results

Table 1 – Table 3 record the mean results from those 500 replications. Case 1 is with the decayed autoregressive covariance matrix. Under the case with low correlation ρ = 0.2, all methods perform well regardless of the type of the nonparametric function. For the situation with high correlation of ρ = 0.8, LASSO is prone to overfit the model, with the exact model being selected only around 20% of the time.

Table 1.

Summarized simulation results for Case 1: p = 1000, n = 200 and X is sampled from normal distribution with autoregressive correlation matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1.

	Method	$p_{A^{c}}^{max}$	$p_{A}^{min}$	$p_{A = M}$	$p_{A \in M}$	p₁	p₄	TP	FP	ME
	LASSO.BIC			0.760	1.000	1.000	1.000	4.000	0.256	0.313
	SCAD.BIC			0.932	1.000	1.000	1.000	4.000	0.082	0.026
f(u) = u²	PC-PR			0.966	0.994	0.994	1.000	3.994	0.030	0.040
ρ = 0.2	TPC-PR			0.964	0.994	0.994	1.000	3.994	0.032	0.041
	TPC-PR.EBIC			0.994	1.000	1.000	1.000	4.000	0.010	0.027
	BSM-DP.MAP (new)	0.073	1.000	0.974	1.000	1.000	1.000	4.000	0.032	0.026
	BSM-DP.MPM (new)	0.073	1.000	0.974	1.000	1.000	1.000	4.000	0.030	0.026
	LASSO.BIC			0.738	1.000	1.000	1.000	4.000	0.288	0.334
	SCAD.BIC			0.924	1.000	1.000	1.000	4.000	0.120	0.029
f(u) = sin(2πu)	PC-PR			0.996	0.994	0.994	1.000	3.994	0.032	0.045
ρ = 0.2	TPC-PR			0.958	0.994	0.994	1.000	3.994	0.040	0.045
	TPC-PR.EBIC			0.982	1.000	1.000	1.000	4.000	0.018	0.029
	BSM-DP.MAP (new)	0.074	1.000	0.974	1.000	1.000	1.000	4.000	0.028	0.026
	BSM-DP.MPM (new)	0.074	1.000	0.970	1.000	1.000	1.000	4.000	0.032	0.026
	LASSO.BIC			0.222	1.000	1.000	1.000	4.000	1.266	0.309
	SCAD.BIC			0.984	0.998	0.998	1.000	3.998	0.020	0.032
f(u) = u²	PC-PR			0.644	0.652	0.768	1.000	3.652	0.094	0.344
ρ = 0.8	TPC-PR			0.652	0.660	0.774	1.000	3.660	0.092	0.338
	TPC-PR.EBIC			0.846	0.858	0.886	1.000	3.858	0.038	0.149
	BSM-DP.MAP (new)	0.053	1.000	0.990	1.000	1.000	1.000	4.000	0.010	0.023
	BSM-DP.MPM (new)	0.053	1.000	0.994	1.000	1.000	1.000	4.000	0.006	0.023
	LASSO.BIC			0.172	1.000	1.000	1.000	4.000	1.362	0.321
	SCAD.BIC			0.986	0.998	0.998	1.000	3.998	0.014	0.040
f(u) = sin(2πu)	PC-PR			0.650	0.656	0.796	1.000	3.998	0.014	0.348
ρ = 0.8	TPC-PR			0.662	0.668	0.804	1.000	3.666	0.132	0.337
	TPC-PR.EBIC			0.840	0.854	0.906	1.000	3.854	0.064	0.159
	BSM-DP.MAP (new)	0.055	1.000	0.982	1.000	1.000	1.000	4.000	0.018	0.023
	BSM-DP.MPM (new)	0.055	1.000	0.984	1.000	1.000	1.000	4.000	0.016	0.023

Open in a new tab

Table 3.

Summarized simulation results for Case 3: p = 1000, n = 200 and X is sampled from mixture of normals with compound symmetric matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1. Results under high correlation ρ = 0.8 are highlighted.

	Method	$p_{A^{c}}^{max}$	$p_{A}^{min}$	$p_{A = M}$	$p_{A \in M}$	p₁	p₄	TP	FP	ME
	LASSO.BIC			0.156	1.000	1.000	1.000	4.000	2.434	0.475
	SCAD.BIC			0.994	1.000	1.000	1.000	4.000	0.008	0.022
f(u) = u²	PC-PR			0.476	0.690	0.724	1.000	3.676	1.304	0.927
ρ = 0.2	TPC-PR			0.574	0.574	0.636	1.000	3.552	0.622	1.252
	TPC-PR.EBIC			0.680	0.694	0.726	1.000	3.680	0.612	0.891
	BSM-DP.MAP (new)	0.046	1.000	0.984	1.000	1.000	1.000	4.000	0.016	0.015
	BSM-DP.MPM (new)	0.046	1.000	0.986	1.000	1.000	1.000	4.000	0.014	0.015
	LASSO.BIC			0.228	1.000	1.000	1.000	4.000	2.194	0.449
	SCAD.BIC			0.990	1.000	1.000	1.000	4.000	0.012	0.018
f(u) = sin(2πu)	PC-PR			0.572	0.768	0.802	1.000	3.776	0.982	0.670
ρ = 0.2	TPC-PR			0.618	0.622	0.682	1.000	3.602	0.544	1.125
	TPC-PR.EBIC			0.748	0.768	0.804	1.000	3.764	0.438	0.667
	BSM-DP.MAP (new)	0.050	1.000	0.986	1.000	1.000	1.000	4.000	0.016	0.014
	BSM-DP.MPM (new)	0.050	1.000	0.988	1.000	1.000	1.000	4.000	0.012	0.014
	LASSO.BIC			0.000	0.998	0.998	1.000	3.998	14.060	0.395
	SCAD.BIC			0.380	0.382	0.432	1.000	3.374	0.120	0.652
f(u) = u²	PC-PR			^***	^***	^***	^***	^***	^***	^***
ρ = 0.8	TPC-PR			^***	^***	^***	^***	^***	^***	^***
	TPC-PR.EBIC			^***	^***	^***	^***	^***	^***	^***
	BSM-DP.MAP (new)	0.092	1.000	0.962	1.000	1.000	1.000	4.000	0.042	0.025
	BSM-DP.MPM (new)	0.092	1.000	0.964	1.000	1.000	1.000	4.000	0.040	0.025
	LASSO.BIC			0.000	0.996	0.996	1.000	3.996	14.002	0.418
	SCAD.BIC			0.422	0.426	0.492	1.000	3.416	0.022	0.672
f(u) = sin(2πu)	PC-PR			^***	^***	^***	^***	^***	^***	^***
ρ = 0.8	TPC-PR			^***	^***	^***	^***	^***	^***	^***
	TPC-PR.EBIC			^***	^***	^***	^***	^***	^***	^***
	BSM-DP.MAP (new)	0.082	1.000	0.960	1.000	1.000	1.000	4.000	0.042	0.020
	BSM-DP.MPM (new)	0.082	1.000	0.970	1.000	1.000	1.000	4.000	0.032	0.020

Open in a new tab

^***

stands for the cases when a single replication takes more than 48 hours. So 500 replications can not be done in timely manner.

It gets more challenging to identify true covariates under dense correlation, which is Case 2 with the compound symmetric covariance matrix. The exact-fit rates are much lower for most of the methods as compared to that in Case 1. It is noteworthy that under high correlation ρ = 0.8, except for our proposed BSM-DP, most other methods do not perform well. LASSO consistently selects a larger model with about 16 spurious covariates on average, while SCAD tuned by BIC is prone to select a smaller model. PC and TPC work evidently slow under dense correlation, when ρ = 0.2, it takes around 12 hours for PC, TPC and TPC-EBIC to finish one replication. More than 48 hours are needed when ρ = 0.8, thus we mark it as stars (***) since 500 replications can not be done in timely manner. Our proposed method (BSM-DP) gives the best results, with high exact-fit rates (above 95%) even under the high dense correlation situation.

In Case 3, X is generated from a mixture normal distribution, with a heavier tail than the normal distribution. Since PC-PR relies heavily on the normality of the covariates, it gives poor results. The updated version of TPC without assuming normality shows improvement. Our proposed method (BSM-DP) still stands out in the comparison with around 95% perfect exact-fit rates.

Overall, when correlation increases, LASSO tuned by BIC tends to overfit the model while SCAD tuned by BIC is more likely to select a smaller model. When X is normally distributed, PC and TPC are similar. But when the normality assumption is violated, TPC performs better than PC. The newly proposed method BSM-DP performs consistently the best, regardless of the correlation strength and distribution of X. The exact-fit rates for all cases are all above 90%.

Remark 7. (On model selection procedure) In our simulation study, models selected by MAP and MPM are very similar to each other. We do not need to select the threshold with MAP. With the MPM, the jth variable is selected if its posterior probability Pr(γ_j = 1|Y) ≥ 0.5. In order to investigate the impact of different thresholds other than 0.5, we further explore one simulation setting Case 2 with ρ = 0.8, and consider various threshold values from 0 to 1. The results are presented in Fig. 2.. With a smaller threshold, more spurious variables are likely to enter the model, so the false discovery rate (dotted red line) is higher. While with a larger threshold which associates with a more stringent selection criterion, we have a higher chance to miss active variables. It is worth noting that although for our simulation setting with $β_{A} = (1.5, 2.0, 2.5, 3.0)$ shown in Fig. 2.(a), the true positive rate (dashed green line) is consistently high as all active variables have marginal inclusion probabilities as 1, but generally we may expect a drop when threshold approaching 1 for most cases as in Fig. 2.(b) with lower signal $β_{A} = (0.8, 1.3, 1.8, 2.3)$ . The exact-fit rate (solid blue line), which gives the proportion of replications with exact model being selected, is reasonably good, as long as the threshold is neither too large nor too small. Overall, 0.5 looks like a good choice.

Fig. 2. — Impact of model selection performance using different threshold values for the posterior probability. The criteria displayed to measure performance include exact-fit rate (solid blue line), false discovery rate (dotted red line) and true positive rate (dashed green line). A range of threshold values from 0 and 1 are used to plot the curve for each criteria. Two signal values with different strengths are considered: $β_{A} = (1.5, 2.0, 2.5, 3.0)$ and $β_{A} = (0.8, 1.3, 1.8, 2.3)$ .

We also compare the computation time required by different methods. Among the five methods mentioned above, LASSO and SCAD are the fastest which take about 2 minutes in R to finish estimation for one replication. PC and TPC are fast as well when covariate X has decayed covariance matrix. They become dramatically slow with high and dense correlation. More than 48 hours is needed for PC using R to finish one replication when the covariance of X is compound symmetric with ρ = 0.8. It may get worse with higher dimensional covariates and larger active groups, since the computational time for both PC and TPC grows polynomially with them. Time is recorded based on Macbook Pro early 2015 with 2.7 GHZ, Intel i5 and 8 GB.

On the contrary, the computation burden for the newly proposed Bayesian method (BSM-DP) is moderate. Among all simulation settings, the slowest one takes about 12 minutes to finish 6000 iterations for one replication using Julia 0.6. As discussed in Remark 1 Section 2.1, the Bayesian subset modeling is scalable, and the computational time only grows approximately linearly with the dimension of covariates. Based on the estimation procedure, the computational complexity for each iteration is n(p ∨ |γ|² ∨ n²), where |γ| is the current active model size. To explore the change in the computation time for different (and especially higher) dimensions of covariates, we record the CPU time to finish 6000 iterations with various p from 100 to 4000, in the simulation setting Case 1 ρ = 0.8. The result is presented in Fig. 3.. The computation time increases nearly linearly with the dimension p. It is not perfectly linear since the number of iterations until convergence seems to grow with p. We notice that for small p (e.g. p < 500), it usually only takes a few iterations to converge and end up with a small |γ|, but it often requires more iterations when p gets larger. There is one place in the above plot which shows a large jump. This might relate to the caching limit on the hardware, especially when p is large.

Fig. 3. — Change in computational time (in minutes) when the dimension of covariates increases from 100 to 4000. The CPU time is estimated by the median computation time consumed among 10 replications for each dimension setting.

3.2. A real data example — supermarket data analysis

In this section, the proposed method is applied to analyze a supermarket data set mentioned in [6, 20, 26]. The data set contains n = 464 daily records of the number of customers, which is the response variable and the sales of p = 6398 products which are predictors. Both the response variable and the predictors are standardized to have zero mean and unit variance. The plot on the left of Fig. 4 shows the relationship between the number of customers and the days. The periodicity of Y is obvious, thus it is reasonable to model Y with a PLM, which takes time variation into account. A covariate U_i = i/n is introduced to represent time. To check the correlation among predictors, we plot the histogram of the sampled correlation in Fig. 4., which shows some moderate correlation. We randomly select 75% of the observations (X_i, Y_i, U_i), i ∈ {1,…, 464) as the training set and keep the remaining 25% as the testing set. The PLM is fitted and variables are selected using the training dataset, then the mean-squared errors of the selected model on the testing dataset are calculated to evaluate the model fit. This procedure is repeated 100 times, and Table 4 summarizes the average size of the selected models and the mean squared errors.

Fig. 4. — Visualizations to display features of the (standardized) supermarket data set. The plot on the left gives the trend of daily number of customers entering a supermarket for 464 days. The histogram on the right describes the distribution of the correlation among sampled predictors.

Table 4.

Comparions of the resulting model size and the mean squared errors by different methods. The values in the table are the means and the corresponding standard errors (in the parenthesis) over the 100 replications. The methods compared are LASSO and SCAD on partial residuals tuned by BIC (SIS-LASSO.BIC, SIS-SCAD.BIC), PC-simple algorithm on the partial residuals (SIS-PC-PR), threshold partial correlation on partial residuals (SIS-TPC-PR, SIS-TPC-PR.EBIC), proposed method with model selected by MPM tuned by EBIC (SIS-BSM-DP.EBIC).

Method	Model Size (s.e.)	MSE on training set (s.e.)	MSE on testing set (s.e.)
SIS-LASSO.BIC	28.80 (3.32)	0.0575 (0.0030)	0.0836 (0.0120)
SIS-SCAD.BIC	15.75 (5.44)	0.0647 (0.0062)	0.0907 (0.0143)
SIS-PC-PR	12.94(1.03)	0.0497 (0.0031)	0.0847 (0.0112)
SIS-TPC-PR	9.81 (0.92)	0.0540 (0.0034)	0.0860 (0.0109)
SIS-TPC-PR.EBIC	8.50 (0.98)	0.0559 (0.0038)	0.0867 (0.0117)
SIS-BSM-DP.EBIC (new)	7.95(1.91)	0.0610 (0.0055)	0.0864 (0.0122)

Open in a new tab

To take the dimension down to a moderate scale, we first apply the SIS [9] on the partial residual to only keep the top 2000 predictors as the set subjected to variable selection. We also implement the LASSO.BIC, SCAD.BIC, PC-PR, TPC-PR and TPC-PR-EBIC on the same data set as comparisons. The choice of the hyperparameters for BSM-DP is the same as the set up in the simulation. we complete 10000 iterations, with the first 6000 as burn-in samples, and rank covariates by their marginal posterior probabilities Pr(γ_j = 1|Y). The candidate set is further selected by EBIC.

After obtaining partial residuals, with LASSO and SCAD, we are able to obtain $\hat{β}$ and select variables simultaneously. But for PC, TPC, and BSM-DP, an estimation of the active set $\hat{A}$ is first obtained, then $\hat{β}$ is estimated by regressing the partial residuals on $\hat{A}$ through OLS. Since we derive the theoretical property of γ, we only use BSM-DP to select the set of active covariates, $\hat{β}$ is also obtained by regressing the partial residuals on the selected covariate set.

As shown in Table 4, LASSO gives the most conservative result with an average size of selected models as 28.80. On the other hand, the MSE is much smaller on the training than that on the testing. This suggests that the LASSO may be overfitting. SCAD selects smaller models with size 15.75 on average, but having large error on testing set. PC, TPC, TPC.EBIC and the newly proposed BSM-DP all select even smaller models with the average size less than 10. Among all, our proposed BSM-DP selects the smallest number of covariates with very similar value of the mean squared error on the testing data set. Fig. 5. illustrates a comparison on the estimated nonparametric function obtained by different methods.

Fig. 5. — The estimates of the nonparametric function for the supermarket data set by different methods including: LASSO and SCAD on partial residuals tuned by BIC (SIS-LASSO.BIC, SIS-SCAD.BIC), PC-simple algorithm on the partial residuals (SIS-PC-PR), threshold partial correlation on partial residuals (SIS-TPC-PR, SIS-TPC-PR.EBIC), and proposed method with model selected by MPM tuned by EBIC (SIS-BSM-DP.EBIC).

4. Technical proofs

This section includes technical proofs for Lemma 1 – Lemma 3 and Theorem 4 – Theorem 5.

Proof of Lemma 1. Note that, as q_n → 0, which is stated in Condition A, we first write out the posterior distribution for parameters as

Pr (γ = M, β, σ, α ∣ Y) \propto π (Y ∣ α, γ, β, σ) π (α ∣ σ) π (β ∣ γ, σ) π (γ) π (σ^{2}) \propto exp {- \frac{1}{2 σ^{2}} {(Y - α - X_{M} β_{M})}^{⊤} (Y - α - X_{M} β_{M}) - \frac{1}{2 σ^{2}} α^{⊤} Σ_{0 n}^{+} α} q_{n}^{| M |} σ^{- 2 n - p_{n}} σ_{1 n}^{- | M |} σ_{0}^{- (p_{n} - | M |)} exp {- \frac{1}{2 σ_{1 n}^{2} σ^{2}} β_{M}^{⊤} β_{M} - \frac{1}{2 σ_{0}^{2} σ^{2}} β_{M^{c}}^{⊤} β_{M^{c}}} π (σ^{2}) .

We first integrate out α, and it follows that

Pr (γ = M, β, σ ∣ Y) \propto q_{n}^{| M |} σ^{- n - p_{n}} σ_{1 n}^{- | M |} σ_{0}^{- (p_{n} - | M |)} exp {- \frac{1}{2 σ_{1 n}^{2} σ^{2}} β_{M}^{⊤} β_{M} - \frac{1}{2 σ_{0}^{2} σ^{2}} β_{M^{c}}^{⊤} β_{M^{c}}} exp [- \frac{1}{2 σ^{2}} {(Y - X_{M} β_{M})}^{⊤} {I_{n} - {(Σ_{0 n}^{+} + I_{n})}^{- 1}} (Y - X_{M} β_{M})] [Σ_{0 n}^{+} + {I_{n} |}^{- 1 / 2} π (σ^{2}) .

Denote $Σ_{1 n} = I_{n} - {(Σ_{0 n}^{+} + I_{n})}^{- 1}$ . It follows by integrating out β that

Pr (γ = M, σ ∣ Y) \propto q_{n}^{| M |} σ^{- n} σ_{1 n}^{- | M |} {| Σ_{0 n}^{+} + I_{n} |}^{- 1 / 2} {| X_{M}^{⊤} Σ_{1 n} X_{M} + I_{| M |} / σ_{1 n}^{2} |}^{- 1 / 2} exp (- \frac{1}{2 σ^{2}} R_{M}) π (σ^{2}),

(5)

where

R_{M} = Y^{⊤} {Σ_{1 n} - Σ_{1 n} X_{M} {(X_{M}^{⊤} Σ_{1 n} X_{M} + I_{| M |} / σ_{1 n}^{2})}^{- 1} X_{M}^{⊤} Σ_{1 n}} Y .

Let X* = C^1/2DX, $λ_{1}^{*} = {min}_{M : | M | \leq m_{n} + | A |} λ_{min} (\frac{1}{n - m} X_{M}^{* ⊤} X_{M}^{*})$ , $λ_{2}^{*} = {max}_{M : M \subseteq A} λ_{max} (\frac{1}{n - m} X_{M}^{* ⊤} X_{M}^{*})$ . As $D^{⊤} D \to (\begin{matrix} 0_{m \times m} & 0_{m \times (n - m)} \\ 0_{(n - m) \times m} & I_{n - m} \end{matrix})$ , for any model $M$ , $X_{M}^{* ⊤} X_{M}^{*}$ is asymptotically equal to $C X_{M}^{⊤} X_{M^{'}}$ , where $M^{'}$ is the subset of $M$ by taking last n − m elements. By Conditions C, E, it can be shown that $λ_{1}^{*} ~ λ_{1}$ and $λ_{2}^{*} ~ λ_{2}$ . Now we put bound on $T_{M} / T_{A}$ . For any model $M$ :

\frac{T_{M}}{T_{M \land A}} = {| I_{n - m} + σ_{1 n}^{2} X_{M}^{*} X_{M}^{* ⊤} |}^{- \frac{1}{2}} {| I_{n - m} + σ_{1 n}^{2} X_{M \land A}^{*} X_{M \land A}^{* ⊤} |}^{\frac{1}{2}} ⪯ {(n σ_{1 n}^{2} λ_{1}^{*})}^{- ∣ M ∣ / 2} {(n σ_{1 n}^{2} λ_{2}^{*})}^{∣ M \land A ∣ / 2}

\frac{T_{M \land A}}{T_{A}} = {| I_{n - m} + σ_{1 n}^{2} X_{M \land A}^{*} X_{M \land A}^{* ⊤} |}^{- \frac{1}{2}} {| I_{n - m} + σ_{1 n}^{2} X_{A}^{*} X_{A}^{* ⊤} |}^{\frac{1}{2}} \leq {| I_{n - m} + σ_{1 n}^{2} X_{A \land M^{*}}^{*} X_{A \land M}^{* ⊤} |}^{\frac{1}{2}} ⪯ {(n σ_{1 n}^{2} λ_{2}^{*})}^{∣ A \land M^{c} ∣ / 2} .

Thus, $\frac{T_{M}}{T_{A}} ⪯ {(n σ_{1 n}^{2} λ_{1})}^{- (| M | - | A |) / 2} {(\frac{λ_{2}}{λ_{1}})}^{| A | / 2}$ , so

P R (M, A) = \frac{Pr (γ = M ∣ Y, σ)}{Pr (γ = A ∣ Y, σ)} = q_{n}^{| M | - | A |} \frac{T_{M}}{T_{A}} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})} ⪯ q_{n}^{| M | - | A |} {(n σ_{1 n}^{2} λ_{1})}^{- (| M | - | A |) / 2} {(\frac{λ_{2}}{λ_{1}})}^{| A | / 2} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})} ⪯ p_{n}^{- 1.5 δ (| M | - | A |) + 0.5 κ} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})} .

□

Proof of Lemma 2. By Condition A, if Σ_0n is taken to be Σ_0n = {(I_n − CD^⊤D)^{− 1} – I_n}⁺, since Σ_1n is defined as $Σ_{1 n} = I_{n} - {(Σ_{0 n}^{+} + I_{n})}^{- 1}$ , so we have

Σ_{1 n} = C D^{⊤} D,

where $D \in ℝ^{(n - m) \times n}$ is the difference matrix defined in (2) and 0 < C ≤ min{1, 1/λ_max(D^⊤D)} is a constant, thus Σ_αn = I_n − Σ_1n and $Σ_{0 n} = {(Σ_{α n}^{- 1} - I_{n})}^{+}$ are semi-positive definite. By taking difference operation on each side, we have

Y^{*} = X^{*} + δ + ω,

where $Y^{*} = C^{1 / 2} D Y \in ℝ^{n - m}$ , $X^{*} = C^{1 / 2} D X \in ℝ^{(n - m) \times p_{n}}$ , $δ = C^{1 / 2} D α \in ℝ^{n - m}$ , $ω = C^{1 / 2} D ϵ \in ℝ^{n - m}$ . The projection matrix is defined as $P_{M} = X_{M}^{*} {(X_{M}^{*} X_{M}^{*})}^{- 1} X_{M}^{* ⊤}$ and furthermore we denote $Q_{M} = I_{n - m} - P_{M}$ . Under the true model, we have $Y = α + X_{A} β_{A} + ϵ$ , then for any model containing the true model $A \subseteq M$ ,

R_{M}^{*} = Y^{⊤} {Σ_{1 n} - Σ_{1 n} X_{M} {(X_{M}^{⊤} Σ_{1 n} X_{M})}^{- 1} X_{M}^{⊤} Σ_{1 n}} Y = Y^{* ⊤} (I_{n - m} - P_{M}) Y^{*} = {(δ + ω)}^{⊤} Q_{M} (δ + ω) = ω^{⊤} Q_{M} ω + 2 ω^{⊤} Q_{M} δ + δ^{⊤} Q_{M} δ .

It suffices to show $δ^{⊤} Q_{M} δ \to 0$ a.s., $ω^{⊤} Q_{M} δ \to 0$ a.s. and $σ^{- 2} ω^{⊤} Q_{M} ω ~ C χ_{n - m - | M |}^{2}$ a.s..

Step 1: To show that $δ^{⊤} Q_{M} δ \to 0$ a.s.: note that

0 \leq δ^{⊤} Q_{M} δ = ‖ δ ‖_{2}^{2} - {‖ P_{M} δ ‖}_{2}^{2} \leq ‖ δ ‖_{2}^{2},

where $δ_{i} = C^{1 / 2} \sum_{t = 1}^{m + 1} d_{t} f (U_{i + m + 1 - t})$ . Thus by $\sum d_{i} = 0$ , $\sum d_{i}^{2} = 1$ , we have

δ_{i}^{2} = C {\sum_{t = 1}^{m + 1} d_{t} f (U_{i + m + 1 - t})}^{2} = C {[d_{2} {f (U_{i + m - 1}) - f (U_{i + m})} + d_{3} {f (U_{i + m - 2}) - f (U_{i + m})} + \dots + d_{m + 1} {f (U_{i}) - f (U_{i + m})}]}^{2} .

By Cauchy-Schwartz inequality and Condition F, for any f (·) ∈ Λ^k(M),

δ_{i}^{2} \leq C (1 - d_{1}^{2}) (\sum_{t = 1}^{m} M^{2} {‖ U_{i + m - t} - U_{i + m} ‖}^{2 (k \land 1)}) \leq O (m^{- 1}) m^{1 + 2 (k \land 1)} U_{\infty}^{2 (k \land 1)} = O ({(m U_{\infty})}^{2 (k \land 1)}) .

By Conditions D, E, F, 1 + (c2 − c1)2(k ∧ 1) < 0, it can be shown that $‖ δ ‖_{2}^{2} = O (n {(m U_{\infty})}^{2 (k \land 1)}) \to 0$ . Then $δ^{⊤} Q_{M} δ \to 0$ a.s..

Step 2: To show that $ω^{⊤} Q_{M} δ \to 0$ a.s.: under fixed design, we have

ω^{⊤} Q_{M} δ ∣ δ ~ N (0, C σ^{2} δ^{⊤} Q_{M} D D^{⊤} Q_{M} δ) .

Since DD^⊤ → I_n−m and $δ^{⊤} Q_{M} δ \to 0$ a.s. from Step 1, so $ω^{⊤} Q_{M} δ \to 0$ a.s.

Step 3: To show that $σ^{- 2} ω^{⊤} Q_{M} ω ~ C χ_{n - m - | M |}^{2}$ a.s.: by Condition E, $ω^{⊤} Q_{M} ω = C ϵ^{⊤} (D^{⊤} Q_{M} D) ϵ$ . Let $J = (0_{(n - m) \times m}, I_{n - m}) \in ℝ^{(n - m) \times n}$ , then ω = C^1/2D_∈ = C^1/2J_∈ + C^1/2(D − J)_∈ = ω₁ + ω₂. Since D → J, so ω₂ is negligible as compared to ω₁ as n goes to infinity. Furthermore $ω_{1}^{⊤} Q_{M} ω_{1} = C ϵ^{⊤} J^{⊤} Q_{M} J ϵ ~ C σ^{2} χ_{n - m - | M |}^{2}$ , therefore $σ^{- 2} ω^{⊤} Q_{M} ω ~ C χ_{n - m - | M |}^{2}$ a.s..

Overall we have $R_{M}^{*} ~ C σ^{2} χ_{n - m - | M |}^{2}$ a.s.. Similarly, write $R_{A}^{*} - R_{M}^{*} = ω^{⊤} (Q_{A} - Q_{M}) ω + 2 ω^{⊤} (Q_{A} - Q_{M}) δ + δ^{⊤} (Q_{A} - Q_{M}) δ$ . It can be proven that the second and third terms are almost surely 0, so $R_{A}^{*} - R_{M}^{*} ~ ω^{⊤} (Q_{A} - Q_{M}) ω ~ C σ^{2} χ_{| M | - | A |}^{2}$ . □

Proof of Lemma 3. We first prove part (i). Note that $R_{A} = Y^{* ⊤} {I_{n - m} - X_{A}^{*} {(\frac{1}{σ_{1 n}^{2}} I + X_{A}^{* ⊤} X_{A}^{*})}^{- 1} X_{A}^{* ⊤}} Y^{*}$ , $R_{A}^{*} = Y^{* ⊤} {I_{n - m} - X_{A}^{*} {(X_{A}^{* ⊤} X_{A}^{*})}^{- 1} X_{A}^{* ⊤}} Y^{*}$ , thus

0 \leq R_{A} - R_{A}^{*} = Y^{* ⊤} X_{A}^{*} {{(X_{A}^{* ⊤} X_{A}^{*})}^{- 1} - {(\frac{1}{σ_{1 n}^{2}} I + X_{A}^{* ⊤} X_{A}^{*})}^{- 1}} X_{A}^{* ⊤} Y^{*} = Y^{* ⊤} X_{A}^{*} {(X_{A}^{*} X_{A}^{*})}^{- 1} {σ_{1 n}^{2} I + {(X_{A}^{* ⊤} X_{A}^{*})}^{- 1}}^{- 1} {(X_{A}^{* ⊤} X_{A}^{*})}^{- 1} X_{A}^{* ⊤} Y^{*} \leq σ_{1 n}^{- 2} Y^{* ⊤} X_{A}^{*} {(X_{A}^{* ⊤} X_{A}^{*})}^{- 2} X_{A}^{*} Y^{*},

where the first equality is due to the Woodbury matrix identity A⁻¹ − (A + UCV)⁻¹ = A⁻¹U(C⁻¹ + VA⁻¹U)⁻¹ VA⁻¹. Denote $M = n X_{A}^{*} {(X_{A}^{* ⊤} X_{A}^{*})}^{- 2} X_{A}^{* ⊤}$ , which has rank $| A |$ and $λ_{max} (M) \leq 1 / λ_{1}^{*} ~ 1 / λ_{1}$ . By [13], we can derive the tail bound for the quadratic term:

Pr (R_{A} - R_{A}^{*} \geq g_{n}) \leq Pr (Y^{* ⊤} M Y^{*} \geq n σ_{1 n}^{2} g_{n}) \leq exp (- c n σ_{1 n}^{2} λ_{1} g_{n})

We next prove part (ii). By Lemma 2, $R_{A}^{*} / (C σ^{2}) ~ χ_{n - m - | A |}^{2}$ , by the tail bound for χ² distribution in [17], for any positive x, we have

Pr {| \frac{R_{A}^{*}}{C σ^{2}} - (n - m - | A |) | \geq 2 (n - m - | A |) (\sqrt{x} + 2 x)} \leq 2 exp {- (n - | A |) x} .

Furthermore, since m = o(n), thus for any fixed ∈ > 0, there exists a constant c > 0, such that

Pr (| \frac{R_{A}^{*}}{n C σ^{2}} - 1 | > ϵ) \leq exp (- c n) .

□

Proof of Theorem 4. We will use strategy related that of Theorem 4.1 in [21] to establish Theorem 4.

For overfitted models $P_{1} = {M : A \subseteq M, | M | \leq m_{n} + | A |}$ , we first put bound on $R_{M} - R_{A}$ . By Lemma 2, for $M \in P_{1}$ , we have $R_{A}^{*} - R_{M}^{*} ~ C σ^{2} χ_{| M | - | A |}^{2}$ . For any x > 0, $\sqrt{2 / 3} < w < 1$ , there exists some constant c > 0, such that

Pr {R_{A}^{*} - R_{M}^{*} > C σ^{2} (2 + 3 x) (| M | - | A |) ln p_{n}} = Pr {χ_{| M | - | A |}^{2} > (2 + 3 x) (| M | - | A |) ln p_{n}} \leq Pr {χ_{| M | - | A |}^{2} - (| M | - | A |) > (2 + 3 w^{2} x) (| M | - | A |) ln p_{n}} \leq c exp {- (1 + x) (| M | - | A |) ln p_{n}} \leq c p_{n}^{- (1 + x) (| M | - | A |)} .

Define events $A (M) = {R_{A} - R_{M} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}}$ , $U (d) = \cup_{M \in P_{1} : | M | = d} A (M)$ .

Then for any fixed s > 0, there exists some c, c′ > 0, such that

Pr {U (d)} = Pr {\underset{M \in P_{1} : | M | = d}{\cup} {R_{A} - R_{M} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}}] = Pr [\underset{M \in P_{1} : | M | = d}{\cup} {R_{A} - R_{M} + R_{M}^{*} - R_{M}^{*} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}}] \leq Pr [\underset{M \in P_{1} : | M | = d}{\cup} {R_{A} - R_{M}^{*} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}}] \leq Pr [\underset{M \in P_{1} : | M | = d}{\cup} {R_{A}^{*} - R_{M}^{*} > C σ^{2} (2 + 3 s) (d - | A |) ln p_{n}}] + Pr [R_{A} - R_{A}^{*} > C σ^{2} s (d - | A |) ln p_{n}],

where the first inequality is due to the fact that $R_{M} - R_{M}^{*} \geq 0$ . By Lemma 3 and Condition A, it follows that

Pr {U (d)} \leq c^{'} p_{n}^{- (1 + s) (d - | A |)} p_{n}^{d - | A |} + exp {- c n σ_{1 n}^{2} λ_{1} (d - | A |) ln p_{n}} \leq 2 c^{'} p_{n}^{- s (d - | A |)} .

Then,

Pr {\underset{d > | A |}{\cup} U (d)} \leq \sum_{d > | A |} Pr {U (d)} \leq \frac{2 c^{'}}{p_{n}^{s} - 1} \to 0.

So consider the high probability event ${\cap_{d > | A |} U {(d)}^{c}}$ , we have

\sum_{M \in P_{1}} P R (M, A) ⪯ \sum_{M \in P_{1}} q_{n}^{| M | - | A |} p_{n}^{- (1 + δ) (| M | - | A |)} exp {- \frac{1}{2 σ^{2}} (R_{M} - R_{A})} ⪯ \sum_{M \in P_{1}} q_{n}^{| M | - | A |} p_{n}^{- (1 + δ) (| M | - | A |)} exp {C (1 + 2 s) (| M | - | A |) ln p_{n}} ⪯ \sum_{M \in P_{1}} q_{n}^{| M | - | A |} p_{n}^{- (1 + δ) (| M | - | A |)} p_{n}^{C (1 + 2 s) (| M | - | A |)} .

As 0 < C < 1, set 0 < s < δ/2, then there exists c > 0 such that

\sum_{M \in P_{1}} P R (M, A) ⪯ p_{n}^{- c} \sum_{| M | - | A | = 1}^{p_{n}} (\begin{matrix} | M | - | A | \\ p_{n} \end{matrix}) p_{n}^{- (| M | - | A |)} ⪯ p_{n}^{- c} \to 0.

Consider large and missing some active variables models $P_{2} = {M : A ⊈ M, K | A | < | M | \leq m_{n} + | A |}$ . Define events

B (M) = {R_{A} - R_{M} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}} \subseteq {R_{A} - R_{M \lor A} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}},

V (d) = \underset{M \in P_{2} : | M | = d}{\cup} B (M) .

Similar to the proof for $P_{1}$ , there exists c′ > 0, such that

Pr {V (d)} \leq P [\underset{M \in P_{2} : | M | = d}{\cup} {R_{A} - R_{M \lor A} > 2 C σ^{2} (1 + 2 s) (| M | - | A |) ln p_{n}}] \leq 2 c^{'} p_{n}^{- (1 + s) (d - | A |)} p_{n}^{d} \leq 2 c^{'} p_{n}^{- (1 + w^{'}) d} p_{n}^{d} \leq 2 c^{'} p_{n}^{- w^{'} d} .

Take s = δ/4, we can find such w′ > 0 as long as $\frac{K - 1}{K} (1 + s) > 1$ . That is, K > 1 + 4/δ, which is stated in Condition B. It follows that

Pr {\underset{d > K | A |}{\cup} V (d)} \leq \sum_{d > K | A |} Pr {V (d)} \leq 2 c^{'} p_{n}^{- w^{'} K | A |} \to 0.

Then consider the high probability event ${\cap_{d > K | A |} V {(d)}^{c}}$ ,

\sum_{M \in P_{2}} P R (M, A) ⪯ \sum_{M \in P_{2}} q_{n}^{| M | - | A |} p_{n}^{- (1 + δ) (| M | - | A |)} exp {C (1 + 2 s) (| M | - | A |) ln p_{n}} ⪯ \sum_{M \in P_{2}} q_{n}^{| M | - | A |} p_{n}^{- δ (M - 1) / 2} ⪯ p_{n}^{- δ (K - 1) / 2} \sum_{| M | - | A | = 1}^{p_{n}} (\begin{matrix} | M | - | A | \\ p_{n} \end{matrix}) p_{n}^{- (| M | - | A |)} ⪯ p_{n}^{- δ (K - 1) / 2} \to 0.

For any model $M$ belonging to the group of underfitted models $P_{3} = {M : A ⊈ M, | M | \leq K | A |}$ , it follows that

R_{M}^{*} - R_{M \lor A}^{*} = {‖ (P_{M} - P_{M \lor A}) Y^{*} ‖}_{2}^{2} = {‖ (P_{M} - P_{M \lor A}) (X_{A}^{*} β_{A} + δ + ω) ‖}_{2}^{2} \geq {{‖ (P_{M} - P_{M \lor A}) X_{A}^{*} β_{A} ‖}_{2} - {‖ (P_{M} - P_{M \lor A}) (δ + ω) ‖}_{2}}^{2} .

By Condition B,

{‖ (P_{M} - P_{M \lor A}) X_{A}^{*} β_{A} ‖}_{2} = {‖ (I - P_{M}) X_{A}^{*} β_{A} ‖}_{2} \geq \sqrt{Δ_{n} (K)} .

And on the other hand, ${‖ (P_{M} - P_{M \lor A}) (δ + ω) ‖}_{2}^{2} = {‖ (P_{M} - P_{M_{\lor} A}) δ ‖}_{2}^{2} + {‖ (P_{M} - P_{M \lor A}) ω ‖}_{2}^{2} - 2 ω^{⊤} (P_{M} - P_{M \lor A}) δ$ . Among them, $0 \leq {‖ (P_{M} - P_{M \lor A}) δ ‖}_{2}^{2} \leq 2 ({‖ P_{M} δ ‖}_{2}^{2} + {‖ P_{M \lor A} δ ‖}_{2}^{2}) \leq 4 ‖ δ ‖_{2}^{2} \to o (1)$ . By the similar trick in Lemma 2, it can be shown $ω^{⊤} (P_{M} - P_{M \lor A}) δ = 0$ a.s..

For any 0 < w < 1,

Pr [\underset{M \in P_{3}}{\cup} {R_{M}^{*} - R_{M \lor A}^{*} < {(1 - w)}^{2} Δ_{n} (K)}] \leq Pr [\underset{M \in P_{3}}{\cup} {{‖ (P_{M} - P_{M \lor A}) ω ‖}_{2} > w / 2 \sqrt{Δ_{n} (K)}}] \leq Pr {{‖ P_{A} ω ‖}_{2} > w / 2 \sqrt{Δ_{n} (K)}} = Pr {C σ^{2} χ_{| A |}^{2} > w^{2} / 4 Δ_{n} (K)} \leq exp {- c Δ_{n} (K) / | A |} .

The last step follows by the bound for tail with quadratic form. For any 0 < w′ < 1,

Pr [\underset{M \in P_{3}}{\cup} {R_{M} - R_{M \lor A} < (1 - w^{'}) Δ_{n} (K)}] = Pr [\underset{M \in P_{3}}{\cup} {(R_{M}^{*} - R_{M \lor A}^{*}) + (R_{M \lor A}^{*} - R_{M \lor A}) + (R_{M} - R_{M}^{*}) < (1 - w^{'}) Δ_{n} (K)}] \leq Pr [\underset{M \in P_{3}}{\cup} {R_{M}^{*} - R_{M \lor A}^{*}} < (1 - w^{'} / 2) Δ_{n} (K)] + Pr [\underset{M \in P_{3}}{\cup} {R_{M \lor A}^{*} - R_{M \lor A}} < - w^{'} / 2 Δ_{n} (K)] \leq 2 exp {- c^{'} Δ_{n} (K) / | A |},

where the first inequality is due to the fact that $R_{M} - R_{M}^{*} \geq 0$ , and the last inequality follows by the exponential tails of $n σ_{1 n}^{2} λ_{1} (R_{M \lor A} - R_{M \lor A}^{*})$ . The proof is similar to Lemma 2 (1).

Let c = 2w, it follows that

Pr [\underset{M \in P_{3}}{\cup} {R_{M} - R_{A} < (1 - c) Δ_{n} (K)}] \leq Pr [\underset{M \in P_{3}}{\cup} {R_{M} - R_{M \lor A} < (1 - w) Δ_{n} (K)}] + Pr [\underset{M \in P_{3}}{\cup} {R_{A} - R_{M \lor A} > w Δ_{n} (K)}] \leq 2 exp {- c^{'} Δ_{n} (K) / | A |} + Pr [\underset{M \in P_{3}}{\cup} {R_{A}^{*} - R_{M \lor A}^{*} > w Δ_{n} (K)}] + Pr {R_{A} - R_{A}^{*} > w Δ_{n} (K)} \leq 3 exp {- c^{'} Δ_{n} (K) / | A |} + Pr {C σ^{2} χ_{K | A |}^{2} > w Δ_{n} (K)} \leq 4 exp {- c^{'} Δ_{n} (K) / | A |} \to 0,

where the second inequality holds because $R_{A} - R_{M \lor A} = R_{A}^{*} - R_{M \lor A}^{*} + R_{A} - R_{A}^{*} + R_{M \lor A}^{*} - R_{M \lor A}$ , and the last step follows by Condition B. Then consider the high probability event ${\cap_{M \in P_{3}} (R_{M} - R_{A}) > (1 - c) Δ_{n} (K)}$ ,

\sum_{M \in P_{3}} P R (M, A) ⪯ \sum_{M \in P_{3}} q_{n}^{| M | - | A |} {(n σ_{1 n}^{2} λ_{1})}^{∣ A ∣ / 2} {(λ_{2} / λ_{1})}^{| A | / 2} exp {- (1 - c) Δ_{n} (K) / (2 σ^{2})} ⪯ exp [- \frac{1}{2 σ^{2}} {(1 - c) Δ_{n} (K) - σ^{2} | A | (2 + 3 δ) ln p_{n} - σ^{2} | A | (2 + κ) ln p_{n}}] .

By Condition B, $Δ_{n} (K) > σ^{2} | A | ln p_{n} (4 + 4 δ + κ)$ , so, we can find 0 < c < 1, w′ > 0, such that $(1 - c) Δ_{n} (K) - σ^{2} | A | ln p_{n} (4 + 3 δ + κ) = w^{'} σ^{2} | A | ln p_{n}$ . Thus,

\sum_{M \in P_{3}} P R (M, A) \leq exp (- \frac{1}{2 σ^{2}} w^{'} σ^{2} | A | ln p_{n}) \to 0.

□

Proof of Theorem 5. Similar to Theorem 4.2 in [21], we start from (5) and integrate out σ², thus the posterior probability under unknown σ² is

Pr (γ = M ∣ Y) \propto \int q_{n}^{| M |} σ^{- n} σ_{1 n}^{- | M |} {| X_{M}^{⊤} X_{M} + I_{n} / σ_{1 n}^{2} |}^{- 1 / 2} exp (- \frac{1}{2 σ^{2}} R_{M}) π (σ^{2}) d σ^{2} \propto q_{n}^{| M |} σ_{1 n}^{- | M |} {| X_{M}^{⊤} X_{M} + I_{n} / σ_{1 n}^{2} |}^{- 1 / 2} {(2 b_{0} + R_{M})}^{- a_{0} - n / 2} .

Remark 8. Inside the integration, $σ^{- n} exp (- \frac{1}{2 σ^{2}} R_{M})$ is the dominant term as the sum of squared residuals $R_{M}$ has an order of O_p(n). The theorem applies to a wider family of prior as long as π(σ²) is O_p(1) and the support is not too strange. This includes some commonly used priors like improper non-informative prior π(σ²) ∝ σ⁻² and the class of folded-noncentral-t prior with fixed hyper-parameters $π (σ) \propto {(1 + \frac{σ^{2}}{v^{2} a^{2}})}^{- (v + 1) / 2}$ .

By Lemma 1, we have

\frac{Pr (γ = M ∣ Y)}{Pr (γ = A ∣ Y)} \propto p_{n}^{- 1.5 δ (| M | - | A |) + 0.5 κ} {(\frac{2 b_{0} + R_{M}}{2 b_{0} + R_{A}})}^{- a_{0} - \frac{n}{2}} .

Define $ρ_{n} = \frac{R_{A} + 2 b_{0}}{n C σ^{2}} - 1$ , then $ρ_{n} = o_{p} (1)$ , since

\frac{R_{A}^{*} + 2 b_{0}}{n C σ^{2}} - 1 < ρ_{n} = \frac{R_{A} + 2 b_{0}}{n C σ^{2}} - 1 = \frac{R_{A}^{*} + 2 b_{0}}{n C σ^{2}} - 1 + \frac{R_{A} - R_{A}^{*}}{n C σ^{2}},

Pr (| ρ_{n} | > 2 ϵ) \leq Pr (| \frac{R_{A}^{*}}{n C σ^{2}} - 1 | > ϵ) + Pr (| \frac{R_{A} - R_{A}^{*}}{n C σ^{2}} | > ϵ) \leq 2 exp (- c n) .

First consider overfitted models $M \in P_{1}$ . Define $x_{n} = (| M | - | A |) ln p_{n} / n$ , t_n = − ln {1 − 2(1 + 4s)x_n}. Then by Condition D, $x_{n} \leq \frac{δ}{(4 + δ) (2 + δ)}$ . Since $- ln (1 - x) \leq \frac{x}{1 - x}$ , then for any s < δ/16, we have

t_{n} = - ln {1 - 2 (1 + 4 s) x_{n}} < \frac{2 (1 + 4 s) x_{n}}{1 - 2 (1 + 4 s) x_{n}} < 2 (1 + δ / 2) x_{n} .

Similar to the proof in Theorem 4 (1), consider the high probability event ${\cap_{d > | A |} U {(d)}^{c}} \cap {| ρ_{n} | < ϵ^{*}}$ , where (1 +4s)(1 − ∈*) > (1 + 2s), then

{(\frac{2 b_{0} + R_{M}}{2 b_{0} + R_{A}})}^{- a_{0} - \frac{n}{2}} = {1 + \frac{R_{M} - R_{A}}{n C σ^{2} (1 + ρ_{n})}}^{- a_{0} - \frac{n}{2}} ⪯ exp {(a_{0} + \frac{n}{2}) t_{n}} ⪯ exp {(a_{0} + \frac{n}{2}) 2 (1 + δ / 2) (| M | - | A |) ln p_{n} / n} ⪯ exp {(1 + δ / 2) (| M | - | A |) ln p_{n}} ⪯ p_{n}^{(1 + δ / 2) (| M | - | A |)} .

The problem reduces to the same problem in Theorem 4 (1). And for $M \in P_{2}$ , we can use the same trick, thus we have

\sum_{M \in P_{1} \cup P_{2}} \frac{Pr (γ = M ∣ Y)}{Pr (γ = A ∣ Y)} \overset{P}{\to} 0.

For underfitted models $M \in P_{3}$ , similar to the proof in Theorem 4 (3), consider the high probability event ${\cap_{M \in P} [R_{M} - R_{A} > (1 - c) Δ_{n} (K)]} \cap {| ρ_{n} | < ϵ^{*}}$ .

If Δ_n(K) = o(n), by lim_n→∞(1 + 1/n)ⁿ = e,

{(\frac{2 b_{0} + R_{M}}{2 b_{0} + R_{A}})}^{- a_{0} - \frac{n}{2}} = {1 + \frac{R_{M} - R_{A}}{n C σ^{2} (1 + ρ_{n})}}^{- a_{0} - \frac{n}{2}} ⪯ {1 + \frac{(1 - c) Δ_{n} (K)}{n C σ^{2} (1 + ρ_{n})}}^{- a_{0} - \frac{n}{2}} ⪯ exp {- (\frac{n}{2} + a_{0}) \frac{(1 - c) Δ_{n} (K)}{n C σ^{2} (1 + ϵ^{*})}} ⪯ exp {- \frac{(1 - c) Δ_{n} (K)}{2 σ^{2} (1 + ϵ^{*})}} .

It reduces to the same problem in Theorem 4 (3).

While if Δ_n(K) ⪰ n, it follows that

{(\frac{2 b_{0} + R_{M}}{2 b_{0} + R_{A}})}^{- a_{0} - \frac{n}{2}} ⪯ {1 + \frac{(1 - c) Δ_{n} (K)}{n C σ^{2} (1 + ρ_{n})}}^{- a_{0} - \frac{n}{2}} ⪯ exp {- c^{'} n ln \frac{Δ_{n} (K)}{n}} ⪯ exp (- c^{'} n),

which converges even faster to 0 as n → ∞. □

5. Discussion

Inspired by the difference-based method, we have proposed a new Bayesian approach to select variables in the linear component of the partially linear model. We modify the Bayesian shrinking and diffusing priors (BASAD) [21], and propose the new Bayesian subset modeling with diffusing prior (BSM-DP). The idea is extended from linear models to the partially linear model with the help of the difference-based method. Model selection consistency is proved under the setting with ultra-high dimensional covariates. Compared to BASAD, BSM-DP performances better in identifying the low signal covariates and in shorter computation time, as shown in the supplementary material. Results in the simulation studies show that our method has higher tolerance on the correlation among predictors and requires mild conditions on covariates, compared with other existing methods for variable selection on PLM. The proposed model is less likely to overfit the model, which is also illustrated by the real data example about supermarket. However, like all other Bayesian methods, it has some price to pay. We do need specific assumptions on the error distribution. The computation is relative intense as compared to frequentist penalized methods. Finally, similar to frequentist methods, although we showed the required rates for the hyperparameters of the priors, the practical choices of them in finite sample applications still need fine tuning.

Supplementary Material

NIHMS1685646-supplement-1.pdf^{(128.6KB, pdf)}

Table 2.

Summarized simulation results for Case 2: p = 1000, n = 200 and X is sampled from normal distribution with compound symmetric correlation matrix. The reported values are means of different performance measures averaged over 500 replications. The methods compared include LASSO and SCAD on partial residuals tuned by BIC (LASSO.BIC, SCAD.BIC), PC-simple algorithm on the partial residuals (PC-PR), threshold partial correlation on partial residuals (TPC-PR, TPC-PR.EBIC), proposed method with model selected by MAP and MPM (BSM-DP.MAP, BSM-DP.MPM). The details of different methods and measures are provided in Section 3.1.1. Results under high correlation ρ = 0.8 are highlighted.

	Method	$p_{A^{c}}^{max}$	$p_{A}^{min}$	$p_{A = M}$	$p_{A \in M}$	p₁	p₄	TP	FP	ME
	LASSO.BIC			0.220	1.000	1.000	1.000	4.000	1.652	0.509
	SCAD.BIC			0.938	1.000	1.000	1.000	4.000	0.070	0.028
f(u) = u²	PC-PR			0.420	0.998	0.998	1.000	3.998	0.766	0.063
ρ = 0.2	TPC-PR			0.406	0.998	0.998	1.000	3.998	0.782	0.063
	TPC-PR.EBIC			0.862	0.998	0.998	1.000	3.998	0.140	0.040
	BSM-DP.MAP (new)	0.077	1.000	0.970	1.000	1.000	1.000	4.000	0.030	0.024
	BSM-DP.MPM (new)	0.077	1.000	0.970	1.000	1.000	1.000	4.000	0.024	0.024
	LASSO.BIC			0.202	1.000	1.000	1.000	4.000	1.746	0.515
	SCAD.BIC			0.950	1.000	1.000	1.000	4.000	0.054	0.027
f(u) = sin(2πu)	PC-PR			0.368	0.984	0.984	1.000	3.984	0.870	0.098
ρ = 0.2	TPC-PR			0.358	0.986	0.986	1.000	3.986	0.884	0.095
	TPC-PR.EBIC			0.826	0.990	0.990	1.000	3.990	0.208	0.060
	BSM-DP.MAP (new)	0.087	1.000	0.970	1.000	1.000	1.000	4.000	0.036	0.032
	BSM-DP.MPM (new)	0.087	1.000	0.976	1.000	1.000	1.000	4.000	0.026	0.033
	LASSO.BIC			0.000	1.000	1.000	1.000	4.000	16.174	0.390
	SCAD.BIC			0.386	0.386	0.412	1.000	3.386	0.000	0.585
f(u) = u²	PC-PR			^***	^***	^***	^***	^***	^***	^***
ρ = 0.8	TPC-PR			^***	^***	^***	^***	^***	^***	^***
	TPC-PR.EBIC			^***	^***	^***	^***	^***	^***	^***
	BSM-DP.MAP (new)	0.105	1.000	0.942	1.000	1.000	1.000	4.000	0.068	0.034
	BSM-DP.MPM (new)	0.105	1.000	0.956	1.000	1.000	1.000	4.000	0.048	0.037
	LASSO.BIC			0.000	1.000	1.000	1.000	4.000	16.114	0.401
	SCAD.BIC			0.426	0.426	0.448	1.000	3.426	0.000	0.574
f(u) = sin(2πu)	PC-PR			^***	^***	^***	^***	^***	^***	^***
ρ = 0.8	TPC-PR			^***	^***	^***	^***	^***	^***	^***
	TPC-PR.EBIC			^***	^***	^***	^***	^***	^***	^***
	BSM-DP.MAP (new)	0.093	1.000	0.970	1.000	1.000	1.000	4.000	0.036	0.032
	BSM-DP.MPM (new)	0.093	1.000	0.976	1.000	1.000	1.087	4.000	0.026	0.033

Open in a new tab

^***

stands for the cases when a single replication takes more than 48 hours. So 500 replications can not be done in timely manner.

Acknowledgments

The authors are grateful to the Editor-in-Chief, an Associate Editor and the referees for comments and suggestions that led to significant improvements. This research was supported by NSF grants DMS 1820702, DMS 1953196, DMS 2015539 and NIH grants R01CA229542 and R01 ES019672. The content is solely the responsibility of the authors and does not necessarily represent the official views of NSF and NIH.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Barbieri MM, Berger JO, Optimal predictive model selection, Ann. Statist 32 (2004) 870–897. [Google Scholar]
[2].Breiman L, Better subset regression using the nonnegative garrote, Technometrics 37 (1995) 373–384. [Google Scholar]
[3].Bühlmann P, Kalisch M, Maathuis MH, Variable selection in high-dimensional linear models: partially faithful distributions and the pc-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]
[4].Chen H, Shiau J-JH, A two-stage spline smoothing method for partially linear models, J. Statist. Plann. Inference 27 (1991) 187–201. [Google Scholar]
[5].Chen J, Chen Z, Extended bic for small-n-large-p sparse glm, Statist. Sinica 22 (2012) 555–574. [Google Scholar]
[6].Chen Z, Fan J, Li R, Error variance estimation in ultrahigh-dimensional additive models, J. Amer. Statist. Assoc 113 (2018) 315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Engle RF, Granger CWJ, Rice J, Weiss A, Semiparametric estimates of the relation between weather and electricity sales, J. Amer. Statist. Assoc 81 (1986) 310–320. [Google Scholar]
[8].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]
[9].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].George EI, Foster DP, Calibration and empirical bayes variable selection, Biometrika 87 (2000) 731–747. [Google Scholar]
[11].George EI, McCulloch RE, Variable selection via gibbs sampling, J. Amer. Statist. Assoc 88 (1993) 881–889. [Google Scholar]
[12].Heckman NE, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B Methodol 48 (1986) 244–248. [Google Scholar]
[13].Hsu D, Kakade S, Zhang T, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab 17 (2012) 1–6. [Google Scholar]
[14].Ishwaran H, Rao JS, Spike and slab variable selection: frequentist and bayesian strategies, Ann. Statist 33 (2005) 730–773. [Google Scholar]
[15].Ishwaran H, Rao JS, Consistency of spike and slab regression, Statist. Probab. Lett 81 (2011) 1920–1928. [Google Scholar]
[16].Johnson VE, Rossell D, Bayesian model selection in high-dimensional settings, J. Amer. Statist. Assoc 107 (2012) 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Laurent B, Massart P, Adaptive estimation of a quadratic functional by model selection, Ann. Statist 28 (2000) 1302–1338. [Google Scholar]
[18].Liang F, Song Q, Yu K, Bayesian subset modeling for high-dimensional generalized linear models, J. Amer. Statist. Assoc 108 (2013) 589–606. [Google Scholar]
[19].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Liu J, Lou L, Li R, Variable selection for partially linear models via partial correlation, J. Multivariate Anal 167 (2018) 418–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Narisetty NN, He X, Bayesian variable selection with shrinking and diffusing priors, Ann. Statist 42 (2014) 789–817. [Google Scholar]
[22].Narisetty NN, Shen J, He X, Skinny gibbs: A consistent and scalable gibbs sampler for model selection, J. Amer. Statist. Assoc 114 (2019) 1205–1217. [Google Scholar]
[23].Rice J, Convergence rates for partially splined models, Statist. Probab. Lett 4 (1986) 203–208. [Google Scholar]
[24].Robinson PM, Root-n-consistent semiparametric regression, Econometrica 56 (1988) 931–954. [Google Scholar]
[25].Tibshirani R, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol 58 (1996) 267–288. [Google Scholar]
[26].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc 104 (2009) 1512–1524. [Google Scholar]
[27].Wang L, Brown LD, Cai TT, A difference based approach to the semiparametric partial linear model, Electron. J. Stat 5 (2011) 619–641. [Google Scholar]
[28].Xie H, Huang J, Scad-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]
[29].Yatchew A, An elementary estimator of the partial linear model, Econ. Lett 57 (1997) 135–143. [Google Scholar]
[30].Yuan M, Lin Y, Efficient empirical bayes variable selection and estimation in linear models, J. Amer. Statist. Assoc 100 (2005) 1215–1225. [Google Scholar]
[31].Yuan M, Lin Y, On the non-negative garrotte estimator, J. R. Stat. Soc. Ser. B Stat. Methodol 69 (2007) 143–161. [Google Scholar]
[32].Zhang C-H, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist 38 (2010) 894–942. [Google Scholar]
[33].Zhu L, Li R, Cui H, Robust estimation for partially linear models with large-dimensional covariates, Science China Mathematics 56 (2013) 2069–2088. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Zou H, Hastie T, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol 67 (2005) 301–320 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1685646-supplement-1.pdf^{(128.6KB, pdf)}

[R1] [1].Barbieri MM, Berger JO, Optimal predictive model selection, Ann. Statist 32 (2004) 870–897. [Google Scholar]

[R2] [2].Breiman L, Better subset regression using the nonnegative garrote, Technometrics 37 (1995) 373–384. [Google Scholar]

[R3] [3].Bühlmann P, Kalisch M, Maathuis MH, Variable selection in high-dimensional linear models: partially faithful distributions and the pc-simple algorithm, Biometrika 97 (2010) 261–278. [Google Scholar]

[R4] [4].Chen H, Shiau J-JH, A two-stage spline smoothing method for partially linear models, J. Statist. Plann. Inference 27 (1991) 187–201. [Google Scholar]

[R5] [5].Chen J, Chen Z, Extended bic for small-n-large-p sparse glm, Statist. Sinica 22 (2012) 555–574. [Google Scholar]

[R6] [6].Chen Z, Fan J, Li R, Error variance estimation in ultrahigh-dimensional additive models, J. Amer. Statist. Assoc 113 (2018) 315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Engle RF, Granger CWJ, Rice J, Weiss A, Semiparametric estimates of the relation between weather and electricity sales, J. Amer. Statist. Assoc 81 (1986) 310–320. [Google Scholar]

[R8] [8].Fan J, Li R, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]

[R9] [9].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].George EI, Foster DP, Calibration and empirical bayes variable selection, Biometrika 87 (2000) 731–747. [Google Scholar]

[R11] [11].George EI, McCulloch RE, Variable selection via gibbs sampling, J. Amer. Statist. Assoc 88 (1993) 881–889. [Google Scholar]

[R12] [12].Heckman NE, Spline smoothing in a partly linear model, J. R. Stat. Soc. Ser. B Methodol 48 (1986) 244–248. [Google Scholar]

[R13] [13].Hsu D, Kakade S, Zhang T, A tail inequality for quadratic forms of subgaussian random vectors, Electron. Commun. Probab 17 (2012) 1–6. [Google Scholar]

[R14] [14].Ishwaran H, Rao JS, Spike and slab variable selection: frequentist and bayesian strategies, Ann. Statist 33 (2005) 730–773. [Google Scholar]

[R15] [15].Ishwaran H, Rao JS, Consistency of spike and slab regression, Statist. Probab. Lett 81 (2011) 1920–1928. [Google Scholar]

[R16] [16].Johnson VE, Rossell D, Bayesian model selection in high-dimensional settings, J. Amer. Statist. Assoc 107 (2012) 649–660. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Laurent B, Massart P, Adaptive estimation of a quadratic functional by model selection, Ann. Statist 28 (2000) 1302–1338. [Google Scholar]

[R18] [18].Liang F, Song Q, Yu K, Bayesian subset modeling for high-dimensional generalized linear models, J. Amer. Statist. Assoc 108 (2013) 589–606. [Google Scholar]

[R19] [19].Liang H, Li R, Variable selection for partially linear models with measurement errors, J. Amer. Statist. Assoc 104 (2009) 234–248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Liu J, Lou L, Li R, Variable selection for partially linear models via partial correlation, J. Multivariate Anal 167 (2018) 418–434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Narisetty NN, He X, Bayesian variable selection with shrinking and diffusing priors, Ann. Statist 42 (2014) 789–817. [Google Scholar]

[R22] [22].Narisetty NN, Shen J, He X, Skinny gibbs: A consistent and scalable gibbs sampler for model selection, J. Amer. Statist. Assoc 114 (2019) 1205–1217. [Google Scholar]

[R23] [23].Rice J, Convergence rates for partially splined models, Statist. Probab. Lett 4 (1986) 203–208. [Google Scholar]

[R24] [24].Robinson PM, Root-n-consistent semiparametric regression, Econometrica 56 (1988) 931–954. [Google Scholar]

[R25] [25].Tibshirani R, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol 58 (1996) 267–288. [Google Scholar]

[R26] [26].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc 104 (2009) 1512–1524. [Google Scholar]

[R27] [27].Wang L, Brown LD, Cai TT, A difference based approach to the semiparametric partial linear model, Electron. J. Stat 5 (2011) 619–641. [Google Scholar]

[R28] [28].Xie H, Huang J, Scad-penalized regression in high-dimensional partially linear models, Ann. Statist 37 (2009) 673–696. [Google Scholar]

[R29] [29].Yatchew A, An elementary estimator of the partial linear model, Econ. Lett 57 (1997) 135–143. [Google Scholar]

[R30] [30].Yuan M, Lin Y, Efficient empirical bayes variable selection and estimation in linear models, J. Amer. Statist. Assoc 100 (2005) 1215–1225. [Google Scholar]

[R31] [31].Yuan M, Lin Y, On the non-negative garrotte estimator, J. R. Stat. Soc. Ser. B Stat. Methodol 69 (2007) 143–161. [Google Scholar]

[R32] [32].Zhang C-H, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist 38 (2010) 894–942. [Google Scholar]

[R33] [33].Zhu L, Li R, Cui H, Robust estimation for partially linear models with large-dimensional covariates, Science China Mathematics 56 (2013) 2069–2088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Zou H, Hastie T, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol 67 (2005) 301–320 [Google Scholar]

PERMALINK

Variable selection for partially linear models via Bayesian subset modeling with diffusing prior

Jia Wang

Xizhen Cai

Runze Li

Abstract

1. Introduction

2. Bayesian subset modeling with diffusing prior

2.1. Model and notation

2.2. Estimation procedure

Fig. 1.

2.3. Selection procedure

2.4. Theoretical results

3. Numerical study

3.1. Simulation study

3.1.1. Simulation settings and the choice of hyperparameters

3.1.2. Simulation results

Table 1.

Table 3.

Fig. 2.

Fig. 3.

3.2. A real data example — supermarket data analysis

Fig. 4.

Table 4.

Fig. 5.

4. Technical proofs

5. Discussion

Supplementary Material

Table 2.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Variable selection for partially linear models via Bayesian subset modeling with diffusing prior

Jia Wang

Xizhen Cai

Runze Li

Abstract

1. Introduction

2. Bayesian subset modeling with diffusing prior

2.1. Model and notation

2.2. Estimation procedure

Fig. 1.

2.3. Selection procedure

2.4. Theoretical results

3. Numerical study

3.1. Simulation study

3.1.1. Simulation settings and the choice of hyperparameters

3.1.2. Simulation results

Table 1.

Table 3.

Fig. 2.

Fig. 3.

3.2. A real data example — supermarket data analysis

Fig. 4.

Table 4.

Fig. 5.

4. Technical proofs

5. Discussion

Supplementary Material

Table 2.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases