A polynomial algorithm for best-subset selection problem

Junxian Zhu; Canhong Wen; Jin Zhu; Heping Zhang; Xueqin Wang

doi:10.1073/pnas.2014241117

. 2020 Dec 16;117(52):33117–33123. doi: 10.1073/pnas.2014241117

A polynomial algorithm for best-subset selection problem

Junxian Zhu ^a,¹, Canhong Wen ^b,¹, Jin Zhu ^a, Heping Zhang ^c,², Xueqin Wang ^b,²

PMCID: PMC7777147 PMID: 33328272

Significance

Best-subset selection is a benchmark optimization problem in statistics and machine learning. Although many optimization strategies and algorithms have been proposed to solve this problem, our splicing algorithm, under reasonable conditions, enjoys the following properties simultaneously with high probability: 1) its computational complexity is polynomial; 2) it can recover the true subset; and 3) its solution is globally optimal.

Keywords: best-subset selection, splicing, high dimensional

Abstract

Best-subset selection aims to find a small subset of predictors, so that the resulting linear model is expected to have the most desirable prediction accuracy. It is not only important and imperative in regression analysis but also has far-reaching applications in every facet of research, including computer science and medicine. We introduce a polynomial algorithm, which, under mild conditions, solves the problem. This algorithm exploits the idea of sequencing and splicing to reach a stable solution in finite steps when the sparsity level of the model is fixed but unknown. We define an information criterion that helps the algorithm select the true sparsity level with a high probability. We show that when the algorithm produces a stable optimal solution, that solution is the oracle estimator of the true parameters with probability one. We also demonstrate the power of the algorithm in several numerical studies.

Subset selection is a classic topic of model selection in statistical learning and is encountered whenever we are interested in understanding the relationship between a response and a set of explanatory variables. Naturally, this problem has been pursued in statistics and mathematics for decades. The classic methods that are commonly described in statistical textbooks include step-wise regression with the Akaike information criterion (1), the Bayesian information criterion (BIC) (2), and Mallows’s $C_{p}$ (3).

Consider $n$ independent observations $(x_{i}, y_{i}), i = 1, \dots, n$ , where $x_{i} \in R^{1 \times p}$ , $y_{i} \in R$ . Let $y = (y_{1}, \dots, y_{n})$ and $X = {(x_{1}^{⊤}, \dots, x_{n}^{⊤})}^{⊤}$ . For convenience, we centralize the columns of $X$ to have zero mean. The following is the classic multivariable linear model with regression coefficient vector $β \in R^{p \times 1}$ and error vector $ϵ \in R^{n \times 1}$ :

y = X β + ϵ .

[1]

Parsimony is desired when we consider a subset of the $p$ explanatory variables in Model 1 with comparable prediction accuracy. When the regression coefficient vector $β$ is sparse, we want to identify this subset of nonzero coefficients. This is the commonly known problem of the best-subset selection that minimizes the empirical risk function, e.g., the sum of residual squares, under the cardinality constraint in Model 1,

min_{β \in R^{p}} \frac{1}{2 n} ‖ y - X {β ‖}_{2}^{2}, s u b j e c t t o ‖ {β ‖}_{0} \leq s,

[2]

where $‖ {β ‖}_{0} = \sum_{i = 1}^{p} I (β_{i} \neq 0)$ is the $ℓ_{0}$ norm of $β$ , and the sparsity level $s$ is usually an unknown nonnegative integer.

The Lagrangian of Eq. 2 represents a balance between goodness of fit and parsimony. The latter is characterized by model complexity that is generally defined as an increasing function of the number of nonzero $β$ values. Thus, this Lagrangian is not continuous and, of course, not smooth. Greedy methods are usually applied to solve such Lagrangian but suffer from computational difficulties even for a reasonably large $p$ . Alternatively, some relaxation methods, e.g., Least-Absolute Shrinkage and Selection Operator (LASSO) (4), Adaptive LASSO (5), Smoothly Clipped Absolute Deviation Penalty (SCAD) (6), and Minimax Concave Penalty (MCP) (7) have been proposed and investigated to ameliorate the computational issue by replacing the nonsmooth penalty function with a smooth approximation. These recently developed methods are computationally feasible and provide near-optimal solutions even for large $p$ . However, their solutions do not lead to the best subset and are known for lack of important statistical properties (8).

There has been little progress on how to find the best-subset selection until recently because such a nonsmooth optimization problem is generally nondeterministic polynomial-time–hard (9). Recently, to make the best-subset selection problem computationally tractable, optimization strategies and algorithms are proposed, including the Iterate Hard Thresholding (IHT) algorithm (10), primal-dual active set (PDAS) methods (11), and the Mixed Integer Optimization (MIO) approach (12). However, their solutions may converge to a local minimizer, and IHT and PDAS may also suffer from the periodic iterative issue. More importantly, these methods do not determine the sparsity-level adaptively, and their statistical properties remain unclear.

In this paper, we directly deal with Eq. 2 and solve the best-subset selection problem with two critical ideas: a splicing algorithm and an information criterion. Our contribution is threefold. Firstly, we propose “splicing,” a technique to improve the quality of subset selection, and derive an efficient iterative algorithm based on splicing, Adaptive Best-Subset Selection (ABESS), to tackle problem 2. The ABESS algorithm is applicable to analyze high dimensional datasets with tens of thousands of observations and variables. Secondly, we prove that ABESS algorithm consistently selects important variables and its computational complexity is polynomial. Our algorithm is stringently shown to solve problem 2 within polynomial times. Finally, to determine the most suitable sparsity level, we design an information criterion (special information criterion [SIC]) whose theoretical best-subset selection consistency is rigorously proven.

We define some useful notations for the content below. For $β = {(β_{1}, \dots, β_{p})}^{⊤} \in R^{p}$ , we define the $ℓ_{q}$ norm of $β$ by $‖ {β ‖}_{q} = {(\sum_{j = 1}^{p} | β_{j} |^{q})}^{1 / q}$ , where $q \in [1, \infty)$ . Let $S = {1, \dots, p}$ , for any set $A \subseteq S$ , denote $A^{c} = S \ A$ as the complement of $A$ and $| A |$ as its cardinality. We define the support set of vector $β$ as $s u p p (β) = {j : β_{j} \neq 0}$ . For an index set $A \subseteq {1, \dots, p}$ , $β_{A} = (β_{j}, j \in A) \in R^{| A |}$ . For matrix $X \in R^{n \times p}$ , define $X_{A} = (X_{j}, j \in A) \in R^{n \times | A |}$ . For any vector $t$ and any set $A$ , $t^{A}$ is defined to be the vector whose $j$ th entry ${(t^{A})}_{j}$ is equal to $t_{j}$ if $j \in A$ and zero otherwise. For instance, ${\hat{β}}^{A}$ is the vector whose $j$ th entry is ${\hat{β}}_{j}$ if $j \in A$ and zero otherwise. ${\hat{t}}^{{j}}$ is the vector whose $j$ th entry is ${\hat{t}}_{j}$ and zero otherwise.

Method

Splicing.

In this section, we describe the splicing method. Consider the $ℓ_{0}$ constraint minimization problem,

min_{β} L_{n} (β), s . t ‖ {β ‖}_{0} \leq s,

where $L_{n} (β) = \frac{1}{2 n} ‖ y - X {β ‖}_{2}^{2}$ . Without loss of generality, we consider $‖ {β ‖}_{0} = s$ . Given any initial set $A \subset S = {1,2, \dots, p}$ with cardinality $| A | = s$ , denote $I = A^{c}$ and compute

\hat{β} = \arg min_{β_{I} = 0} L_{n} (β) .

We call $A$ and $I$ as the active set and the inactive set, respectively.

Given the active set $A$ and $\hat{β}$ , we can define the following two types of sacrifices:

1)
Backward sacrifice: For any $j \in A$ , the magnitude of discarding variable $j$ is,

ξ_{j} = L_{n} ({\hat{β}}^{A \ {j}}) - L_{n} ({\hat{β}}^{A}) = \frac{X_{j}^{⊤} X_{j}}{2 n} {({\hat{β}}_{j})}^{2} .

[3]

2)
Forward sacrifice: For any $j \in I$ , the magnitude of adding variable $j$ is,

ζ_{j} = L_{n} (\hat{β^{A}}) - L_{n} ({\hat{β}}^{A} + {\hat{t}}^{{j}}) = \frac{X_{j}^{⊤} X_{j}}{2 n} {(\frac{{\hat{d}}_{j}}{X_{j}^{⊤} X_{j} / n})}^{2},

[4]

where $\hat{t} = \arg min_{t} L_{n} ({\hat{β}}^{A} + t^{{j}}), {\hat{d}}_{j} = X_{j}^{⊤} (y - X \hat{β}) / n$ .

Intuitively, for $j \in A$ (or $j \in I$ ), a large $ξ_{j}$ (or $ζ_{j}$ ) implies the $j$ th variable is potentially important. Unfortunately, it is noteworthy that these two sacrifices are incomparable because they have different sizes of support set. However, if we exchange some “irrelevant” variables in $A$ and some “important” variables in $I$ , it may result in a higher-quality solution. This intuition motivates our splicing method.

Specifically, given any splicing size $k \leq s$ , define

A_{k} = \{j \in A : \sum_{i \in A} I (ξ_{j} \geq ξ_{i}) \leq k\}

to represent $k$ least relevant variables in $A$ and

I_{k} = \{j \in I : \sum_{i \in I} I (ζ_{j} \leq ζ_{i}) \leq k\}

to represent $k$ most relevant variables in $I$ . Then, we splice $A$ and $I$ by exchanging $A_{k}$ and $I_{k}$ and obtain a new active set

\tilde{A} = (A \ A_{k}) \cup I_{k} .

Let $\tilde{I} = {\tilde{A}}^{c}$ , $\tilde{β} = \arg min_{β_{\tilde{I}} = 0} L_{n} (β)$ , and $τ_{s} > 0$ be a threshold. If $τ_{s} < L_{n} (\hat{β}) - L_{n} (\tilde{β})$ , then $\tilde{A}$ is preferable to $A$ . The active set can be updated iteratively until the loss function cannot be improved by splicing. Once the algorithm recovers the true active set, we may splice some irrelevant variables, and then the loss function may decrease slightly. The threshold $τ_{s}$ can reduce this unnecessary calculation. Typically, $τ_{s}$ is relatively small, e.g., $τ_{s} = 0.01 s \log (p) \log (\log n) / n$ .

The remaining problem is to determine the initial set. Typically, we select the first $s$ variables that are most correlated with $y$ variables as the initial set $A$ . Let $k_{m a x}$ be the maximum splicing size, $k_{m a x} \leq s$ . In the following, we summarize our arguments in the above:

Algorithm 1: BESS.Fix(

s

): Best-Subset Selection with a given support size

s

1) Input:

X

y

, a positive integer

k_{m a x}

, and a threshold

τ_{s}

2) Initialize

A^{0} = {j : \sum_{i = 1}^{p} I (| \frac{X_{j}^{⊤} y}{\sqrt{X_{j}^{⊤} X_{j}}} | \leq | \frac{X_{i}^{⊤} y}{\sqrt{X_{i}^{⊤} X_{i}}} \leq s}, I^{0} = {(A^{0})}^{c}

and (

β^{0}

d^{0}

β_{I^{0}}^{0} = 0

d_{A^{0}}^{0} = 0

β_{A^{0}}^{0} = {(X_{A^{0}}^{⊤} X_{A^{0}})}^{- 1} X_{A^{0}}^{⊤} y

d_{I^{0}}^{0} = X_{I^{0}}^{⊤} (y - X β^{0}) / n .

3) For

m = 0,1, \dots

, do

(β^{m + 1}, d^{m + 1}, A^{m + 1}, I^{m + 1}) = Splicing (β^{m}, d^{m}, A^{m}, I^{m}, k_{m a x}, τ_{s}) .

(A^{m + 1}, I^{m + 1}) = (A^{m}, I^{m})

, then stop

end for

4) Output

(\hat{β}, \hat{d}, \hat{A}, \hat{I}) = (β^{m + 1}, d^{m + 1} A^{m + 1}, I^{m + 1})

Open in a new tab

Note that splicing size k is an important parameter in splicing. Typically, we can try all possible values of k≤s.

*Algorithm 2:* Splicing ( $β$ , $d$ , $A$ , $I$ , $k_{m a x}$ , $τ_{s}$ ).
1) Input: $β$ , $d$ , $A$ , $I$ , $k_{m a x}$ , and $τ_{s}$ .
2) Initialize $L_{0} = L = \frac{1}{2 n} ‖ y - X {β ‖}_{2}^{2}$ , and set
$ξ_{j} = \frac{X_{j}^{⊤} X_{j}}{2 n} {(β_{j})}^{2}, ζ_{j} = \frac{X_{j}^{⊤} X_{j}}{2 n} {(\frac{d_{j}}{X_{j}^{⊤} X_{j} / n})}^{2}, j = 1, \dots, p$ .
3) For $k = 1,2, \dots, k_{m a x}$ , do
$A_{k} = {j \in A : \sum_{i \in A} I (ξ_{j} \geq ξ_{i}) \leq k}$ ,
$I_{k} = {j \in I : \sum_{i \in I} I (ζ_{j} \leq ζ_{i}) \leq k} .$
Let ${\tilde{A}}_{k} = (A \ A_{k}) \cup I_{k}$ , ${\tilde{I}}_{k} = (I \ I_{k}) \cup A_{k}$ and solve
${\tilde{β}}_{{\tilde{A}}_{k}} = {(X_{{\tilde{A}}_{k}}^{⊤} X_{{\tilde{A}}_{k}})}^{- 1} X_{{\tilde{A}}_{k}}^{⊤} y, {\tilde{β}}_{{\tilde{I}}_{k}} = 0,$
$\tilde{d} = X^{⊤} (y - X \tilde{β}) / n, L_{n} (\tilde{β}) = \frac{1}{2 n} ‖ y - X {\tilde{β} ‖}_{2}^{2} .$
If $L > L_{n} (\tilde{β})$ , then
$(\hat{β}, \hat{d}, \hat{A}, \hat{I}) = (\tilde{β}, \tilde{d}, {\tilde{A}}_{k}, {\tilde{I}}_{k})$ ,
$L = L_{n} (\tilde{β}) .$
End for
4) If $L_{0} - L < τ_{s}$ , then $(\hat{β}, \hat{d}, \hat{A}, \hat{I}) = (β, d, A, I)$ .
5) Output ( $\hat{β}, \hat{d}, \hat{A}, \hat{I}$ ).

Open in a new tab

ABESS.

In practice, the support size is usually unknown. We use a data-driven procedure to determine $s$ . Information criteria such as high-dimensional BIC (HBIC) (13) and extended BIC (EBIC) (14) are commonly used for this purpose. Specifically, HBIC (13) can be applied to select the tuning parameter in penalized likelihood estimation. To recover the support size $s$ for the best-subset selection, we introduce a criterion that is a special case of HBIC (13). While HBIC aims to tune the parameter for a nonconvex penalized regression, our proposal is used to determine the size of best subset. For any active set $A$ , define an SIC as follows:

S I C (A) = n \log L_{A} + | A | \log (p) \log \log n,

where $L_{A} = min_{β_{I} = 0} L_{n} (β), I = {(A)}^{c}$ . To identify the true model, the model complexity penalty is $\log p$ and the slow diverging rate $\log \log n$ is set to prevent underfitting. Theorem 4 states that the following ABESS algorithm selects the true support size via SIC.

Let $s_{m a x}$ be the maximum support size. Theorem 4 suggests $s_{m a x} = o (\frac{n}{\log p})$ as the maximum possible recovery size. Typically, we set $s_{m a x} = [\frac{n}{\log p \log \log n}]$ , where $[x]$ denotes the integer part of $x$ .

*Algorithm 3:* ABESS.
1) Input: $X$ , $y$ , and the maximum support size $s_{max}$ .
2) For $s = 1,2, \dots, s_{m a x}$ , do
$({\hat{β}}_{s}, {\hat{d}}_{s}, {\hat{A}}_{s}, {\hat{I}}_{s}) = BESS.Fixed (s) .$
End for
3) Compute the minimum of SIC:
$s_{m i n} = \arg min_{s} S I C ({\hat{A}}_{s}) .$
4) Output $({\hat{β}}_{s_{m i n}}, {\hat{d}}_{s_{m i n}}, {\hat{A}}_{s_{m i n}}, {\hat{I}}_{s_{m i n}})$ .

Open in a new tab

Theoretical Results

We establish the computational complexity and the consistency of the best subset recovery from the ABESS algorithm.

Conditions.

Let $β^{*}$ be the true regression coefficient with the sparsity level $s^{*}$ in Model 1. Denote the true active set by $A^{*} = s u p p (β^{*})$ and the minimal signal strength by $b^{*} = min_{j \in A^{*}} {(β_{j}^{*})}^{2}$ . Without loss of generality, assume the design matrix $X$ has $\sqrt{n}$ -normalized columns, i.e., $X_{j}^{⊤} X_{j} = n, j = 1,2, \dots, p$ . We say that $X$ satisfies the Sparse Restricted Condition (SRC) (15) with order $s$ and spectrum bound $0 < c_{-} (s) < c_{+} (s) < \infty$ if $\forall A \subset S with | A | \leq s and \forall u \neq 0, u \in R^{| A |}$ ,

c_{-} (s) \leq \frac{‖ X_{A} {u ‖}_{2}^{2}}{n ‖ {u ‖}_{2}^{2}} \leq c_{+} (s) .

We denote this condition by $X \sim S R C {s, c_{-} (s), c_{+} (s)}$ . The SRC gives the range of the spectrum of the diagonal submatrices of the Gram matrix $G = X^{⊤} X / n$ . The spectrum of the off-diagonal submatrices of $G$ can be bounded by the sparse orthogonality constant $θ$ _a,b, defined as the smallest number such that

θ_{a, b} \geq \frac{‖ X_{A}^{⊤} X_{B} {u ‖}_{2}}{n ‖ {u ‖}_{2}},

for $\forall A, B \subset S, | A | \leq a, | B | \leq b a n d A \cap B = \emptyset$ and $\forall u \neq 0, u \in R^{| B |}$ . For any $0 < Δ < \frac{1}{2}$ , denote

δ_{s} ≐ \frac{8 c_{+} (s) {((1 + Δ) \frac{θ_{s, s}}{c_{-} (s)} (1 + \frac{θ_{s, s}}{c_{-} (s)}))}^{2}}{(1 - Δ) (c_{-} (s) - \frac{θ_{s, s}^{2}}{c_{-} (s)})} .

[5]

To prove the theoretical properties of the $ℓ_{0}$ estimator, we assume the following conditions:

1)
The random errors $ϵ_{1}, \dots, ϵ_{n}$ are i.i.d with mean zero and sub-Gaussian tails; that is, there exists a $σ > 0$ such that $P {| ϵ_{i} | \geq t} \leq 2 \exp (- t^{2} / σ^{2})$ , for all $t \geq 0$ .
2)
$X \sim S R C {2 s, c_{-} (2 s), c_{+} (2 s)}$ .
3)
$δ_{s} < 1$ , where $δ_{s}$ is defined in Eq. 5.
4)
$τ_{s} = O (\frac{s \log p \log \log n}{n})$ .
5)
$\frac{s^{*} \log p}{n} = o (1)$ .
6)
$\frac{1}{b^{*}} = o (\frac{n}{s \log p \log \log n})$ .
7)
$\frac{s^{*} \log p \log (\log n)}{n} = o (1)$ and $\frac{s_{m a x} \log p}{n} = o (1)$ .

Remark 1:

The sub-Gaussian condition is often assumed in the related literature and slightly weaker than the standard normality assumption. Condition 2 imposes bounds on the $2 s$ -sparse eigenvalues of the design matrix. As a typical condition in modeling involving high-dimensional data, it restricts the correlation among a small number of variables and thus guarantees the identifiability of the true active set. For example, the SRC has been assumed in existing methods (15–17). Sufficient conditions are provided for a design matrix to satisfy the SRC in propositions 4.1 and 4.2 in ref. 15.

Remark 2:

To verify condition 3, let $c (s) = (1 - c_{-} (2 s)) \lor (c_{+} (2 s) - 1)$ , which is closely related to the restricted isometry property (RIP) (18) constant $δ_{2 s}$ for $X$ . By lemma 20 in ref. 19, a sufficient condition for condition 3 is $c (s) \leq 0.1877$ , i.e., $c_{-} (2 s) \geq 0.8123, c_{+} (2 s) \leq 1.1877$ , which is weaker than the condition $c (s) \leq 0.1599$ in ref. 19.

Remark 3:

Condition 4 ensures that the threshold $τ_{s}$ can control random errors. Condition 6 is the minimal magnitude of the signal for the best subset recovery. To discriminate between the signal and threshold, the signal needs to be stronger than the threshold. The condition is slightly stronger than the condition in ref. 20.

Remark 4:

For the recovery of the true active set, the true sparsity level $s^{*}$ and the maximum model size $s_{max}$ cannot be too large. Condition 7 is weaker than the condition in ref. 13 as we consider the least-squares loss function without concave penalty. As shown in the SI Appendix, condition 5 can be removed.

Computational Theory.

Firstly, we show that the splicing method converges in finite steps.

Theorem 1.

Algorithm 1 terminates in a finite number of iterations.

This follows immediately from the fact that $L_{n} (β^{m + 1}) < L_{n} (β^{m})$ . Furthermore, the next theorem delineates the polynomial complexity of the ABESS algorithm.

Theorem 2.

Suppose conditions 1 and 4 hold. Assume conditions 2, 3, and 6 hold with $s_{m a x}$ . The computational complexity of ABESS for a given $s_{m a x}$ is

\begin{array}{l} O & ((s_{m a x} \log \frac{‖ {y ‖}_{2}^{2}}{\log p \log \log n} + \frac{n ‖ {y ‖}_{2}^{2}}{\log p \log \log n}) (n p s_{m a x} + \\ n s_{m a x}^{2} + k_{m a x} p s_{m a x})) . \end{array}

If $s \geq s^{*}$ , Algorithm 1 will find the true active set in high probability under conditions 1–4 (Lemma 1). Furthermore, by splicing, the loss function decreases drastically at the first several iterations and the convergence rate $O (\log \frac{‖ {y ‖}_{2}^{2}}{s \log p \log \log n})$ of Algorithm 1 is presented in Theorem 3. However, if $s < s^{*}$ , we can determine the iterations $O (\frac{n ‖ {y ‖}_{2}^{2}}{s \log p \log \log n})$ of Algorithm 1 by using thresholding $τ_{s}$ to exclude useless splicing. Thus, we can show that the number of iterations of Algorithm 1 is polynomial.

Statistical Theory.

Let $γ_{s} (n, p) = O (\exp {\log p - \frac{K_{s} n b^{*}}{s}}) + O (\exp {\log p - \frac{K_{s} n}{s^{*}}})$ , where $K_{s}$ is some constant depending on $s$ . The following lemma gives an interesting property of the active set output by Algorithm 1.

Lemma 1.

Suppose $(\hat{β}, \hat{d}, \hat{A}, \hat{I})$ is the solution of Algorithm 1 for a given support size $s \geq s^{*}$ and conditions 1–4 hold. Then, we have

P (\hat{A} \supseteq A^{*}) \geq 1 - γ_{s} (n, p) .

Furthermore, if conditions 5 and 6 hold,

lim_{n \to \infty} P (\hat{A} \supseteq A^{*}) = 1 .

Especially, if $s = s^{*}$ , we have

lim_{n \to \infty} P (\hat{A} = A^{*}) = 1 .

Lemma 1 indicates that our estimator of the active set will eventually include the true active set. The next theorem characterizes the number of iterations and the $ℓ_{2}$ bound error of the splicing method.

Theorem 3.

Suppose $(β^{m}, d^{m}, A^{m}, I^{m})$ is the $m$ th iteration of Algorithm 1 for a given support size $s \geq s^{*}$ . Suppose conditions 1–4 hold. Then, with probability $1 - γ_{s} (n, p)$ , we have

1)
$A^{m} \supseteq A^{*}, if m \geq \log_{\frac{1}{δ_{s}}} (\frac{‖ {y ‖}_{2}^{2}}{C_{s, 1}}),$

where $C_{s, 1} = n (1 - Δ) (c_{-} (s) - \frac{θ_{s, s}^{2}}{c_{-} {(s)}^{2}}) b^{*}$ , $b^{*}$ is the minimal signal strength, and $δ_{s}$ is defined in condition 3;

2)
${‖ β^{m} - β^{*} ‖}_{2}^{2} \leq C_{s, 2} δ^{m} {‖ y ‖}_{2}^{2} .$

where $C_{s, 2} = (1 + Δ + \frac{θ_{s, s}}{c_{-} (s)}) / ((1 - Δ) n (c_{-} (s) - \frac{θ_{s, s}^{2}}{c_{-} (s)}))$ .

With the threshold $τ_{s}$ , Theorem 3 suggests that our splicing method terminates at a logarithm number of iterations. The estimation error decays geometrically.

The next theorem guarantees that the splicing method can recover the true active set with a high probability.

Theorem 4 (Consistency of Best-Subset Recovery).

Suppose conditions 1, 4, and 7 hold. Assume conditions 2, 3, and 6 hold with $s_{m a x}$ . Then, under the information criterion SIC, with probability $1 - O (p^{- α})$ , for some positive constant $α > 0$ and a sufficiently large $n$ , the ABESS algorithm selects the true active set, that is, ${\hat{A}}_{s_{m i n}} = A^{*}$ .

Theorem 4 implies that the solution of the splicing method is the same as the oracle least-squares estimator with an unknown sparsity level. Since our approach can recover the true active set, we can directly deduce the asymptotic distribution of $β$ .

Corollary 1 (Asymptotic Properties).

Suppose the assumptions and conditions in Theorem 4 hold. Then, with a high probability, the solution ${\hat{β}}_{s_{m i n}}$ of ABESS is the oracle estimator, i.e.,

P {{\hat{β}}_{s_{m i n}} = {\hat{β}}^{o}} = 1 - O (p^{- α}),

where $α > 0$ and ${\hat{β}}^{o}$ is the least-squares estimator given the true active set $A^{*}$ . Furthermore,

{\hat{β}}_{A^{*}} \sim N (β_{A^{*}}^{*}, Σ),

where $Σ = {(X_{A^{*}}^{⊤} X_{A^{*}})}^{- 1}$ .

Simulation

In this part, we compare the proposed ABESS algorithm with other variable selection algorithms under Model 1, where the rows of the design matrix are i.i.d-sampled from the multivariate normal distribution with mean $0$ and covariance matrix $Σ$ . The $n$ error terms are i.i.d-drawn from the normal distribution $N (0, σ^{2})$ .

We consider four criteria to assess the methods. The first two criteria, true-positive rate (TPR) and true-negative rate (TNR), are used to evaluate the performance of variable selection. The estimation accuracy for $β$ is measured by the relative error (ReErr): ${‖ \hat{β} - β ‖}_{2} / {‖ β ‖}_{2}$ . We also examine the dispersion between the sparsity level estimation $ŝ$ and the ground truth, which is measured by the sparsity-level error (SLE): $ŝ - s^{*}$ . All simulation results are based on 100 synthetic datasets.

Low-Dimensional Case.

We begin with a low-dimensional setting and compare ABESS and all-subsets regression (ASR), which exhaustively searches for the best subsets of the explanatory variables to predict the response via an efficient branch-and-bound algorithm (21). We use SIC (ASR-SIC) to select a model size for ASR.

We adopt a simulation model from ref. 6. Specifically, the coefficient is fixed at $β = {(3,1.5, 0,0,2,0,0,0)}^{⊤}$ , the covariance matrix $Σ$ has a decayed structure, i.e., $Σ_{i j} = 0 . 5^{| i - j |}$ for all $i, j \in {1, \dots, p}$ . The pair of sample size and noise level $(n, σ)$ varies as $(40,3), (40,1)$ , and $(60,1)$ . It can be seen from Table 1 that when the noise level is large but the sample size is small, the performance of ABESS and ASR is close, although ASR is slightly better. When the noise level reduces, the slight advantage of ASR-SIC disappears. The fact that ABESS performs as well as the exhaustive ASR algorithm, when the setting is simple enough for the latter to be computationally feasible, demonstrates the power of ABESS in selecting the best subset.

Table 1.

Simulation results in the low-dimensional setting

Method	TPR	TNR	ReErr	SLE
$n = 40, σ^{2} = 3$
ABESS	0.90 (0.17)	0.86 (0.15)	0.20 (0.19)	0.40 (0.89)
ASR-SIC	0.91 (0.17)	0.87 (0.15)	0.14 (0.13)	0.38 (0.85)
$n = 40, σ^{2} = 1$
ABESS	1.00 (0.00)	0.87 (0.14)	0.02 (0.02)	0.63 (0.72)
ASR-SIC	1.00 (0.00)	0.87 (0.14)	0.02 (0.02)	0.63 (0.72)
$n = 60, σ^{2} = 1$
ABESS	1.00 (0.00)	0.90 (0.13)	0.01 (0.01)	0.48 (0.64)
ASR-SIC	1.00 (0.00)	0.90 (0.13)	0.01 (0.01)	0.49 (0.64)

Open in a new tab

Next, we study the computational time and computational complexity of the ASR and ABESS algorithms by adding zeros to $β$ in the previous experiment to form a new $β$ of a total of $p$ coefficients. Without loss of generality, we consider the runtime of algorithms when $p$ increases from 20 to 40 with step size 1. Fig. 1 presents the simulation results. On the one hand, from Fig. 1A, we can see that the difference between ASR-SIC and ABESS in the three criteria are all under control in interval $(- 5 \times 1 0^{- 3}, 5 \times 1 0^{- 3})$ , and, hence, we can conclude that ABESS and ASR have a negligible difference in this setting. On the other hand, from Fig. 1 B and C, the computational time of ASR is 20 s when dimensionality reaches 40, while that of ABESS is less than 0.03 s. More importantly, from Fig. 1B, the computational time of ABESS grows linearly when the dimension increases, as proven in Theorem 2. In contrast, from Fig. 1C, the runtime of ASR increases exponentially. In summary, ABESS not only can recover the support but also is computationally fast.

Fig. 1. — (A) For each of the three metrics (TPR, TNR, and ReErr), the difference (y axis) is calculated by subtracting an ABESS metric from its corresponding ASR metric. Different colors correspond to different metrics. (B) Dimension ( $x$ axis) versus ABESS’s runtime ( $y$ axis) scatterplot. The blue straight line is characterized by equation $y = a + b x$ . (C) Dimension ( $x$ axis) versus ASR’s runtime ( $y$ axis) scatterplot. The red curve is $y = a + b 2^{x}$ . In B and C, the coefficients $a, b$ are estimated by the ordinary least squares.

High-Dimensional Case.

We consider the case when the dimension is in hundreds or even thousands, for which the exhaust search is computationally infeasible. It is of interest to compare ABESS and modern variable selection algorithms, including LASSO (4), SCAD (6), and MCP (7). The solutions of the three algorithms are given by the coordinate descent algorithm (22, 23) implemented in R packages glmnet and ncvreg. For all of these methods, we use SIC to select the optimal regularized parameters. We also consider cross-validation (CV), a widely used method, to select the tuning parameter. For MCP/SCAD/LASSO, the $l$ regularized parameters to be selected are prespecified values following the default setting in R packages glmnet and ncvreg. For a fair comparison, the input argument of the ABESS algorithm, $s_{m a x}$ , is also set as $l$ . Here, $l$ is set to be $[\frac{n}{\log p \log \log n}]$ . Note that the concavity parameter $γ$ of the SCAD and MCP penalties is fixed at 3.7 and 3, respectively (6, 7).

The dimension, $p$ , of the explanatory variables increases as 500, 1,500, and 2,500, but only 10 randomly selected variables from them would affect the response. Among the 10 effective variables, 3 of them have a strong effect, 4 of them have a moderate effect, and the rest have a weak effect. Here, a strong/moderate/weak effect means that a coefficient is sampled from a zero-mean normal distribution with SD10/5/2. We consider two structures of $Σ$ . The first one is the uncorrelated structure $Σ_{i j} = I (i = j)$ , and the second one is a constant structure $Σ_{i j} = 0 . 8^{I (i \neq j)}$ , corresponding to the case that any two explanatory variables are highly correlated. The sample size $n$ is fixed at 500, and the noise level $σ^{2}$ is fixed at 1.

The simulation results are presented in Fig. 2 A and B. A few observations are noteworthy. First, among all of the methods, ABESS or the CV-based LASSO estimator have the best performance for correctly identifying the true effective variables; moreover, ABESS can reasonably control the false-positive rate at a low level like SCAD and MCP. Second, SIC helps ABESS efficiently detect the true model size and its SLE approaches to 0. In conjunction with the first point, the empirical results demonstrate ABESS’s performance as proven in Theorem 4. In contrast, the MCP and SCAD underestimate the model size, whereas LASSO overestimates it. Also, we see that like BIC (24), SIC avoids overfitting (see additional simulation studies in SI Appendix). Finally, the parameter estimation of ABESS is superior to the other algorithms because ABESS not only effectively recovers the support set but also yields an unbiased parameter estimate. Fig. 2C compares the runtime. We see that ABESS’s runtime is computationally efficient. Furthermore, as expected, ABESS is much faster than the CV-based LASSO/SCAD/MCP methods.

Summary

We present an iterative splicing method that distinguishes the active set from the inactive set iteratively in variable selection. The estimated active set is shown to contain the true active set when the given support size is no less than the true size, or to be included in the true active set when the given support size is less than the true size. We also introduce a selection information criterion to adaptively determine the sparsity level, which can guarantee to choose the true active set with a high probability. We show that our solution is globally optimal for the Lagrangian of Eq. 2 with SIC and has the oracle properties with a high probability. Numerical results demonstrate the theoretical properties of ABESS. However, when there are a large number of weak effects, the ambiguity makes it challenging for us to detect signals. ABESS as well as other methods such as LASSO, SCAD, and MCP face a similar difficulty. How to perform an effective subset selection with many weak effects warrants further research.

Supplementary Material

Supplementary File

pnas.2014241117.sapp.pdf^{(459.7KB, pdf)}

Acknowledgments

X.W.’s research is partially supported by National Key Research and Development Program of China Grant 2018YFC1315400, Natural Science Foundation of China (NSFC) Grants 71991474 and 11771462, and Key Research and Development Program of Guangdong, China Grant 2019B020228001. H.Z.’s research is supported in part by US NIH Grants R01HG010171 and R01MH116527 and NSF Grant DMS1722544. C.W.’s research is partially supported by NSFC Grant 11801540, Natural Science Foundation of Anhui Grant BJ2040170017, and Fundamental Research Funds for the Central Universities Grant WK2040000016.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

R.L. is a guest editor invited by the Editorial Board.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2014241117/-/DCSupplemental.

Data Availability.

All study data are included in the article and SI Appendix.

References

1.Akaike H., “Information theory and an extension of the maximum likelihood principle” in Selected Papers of Hirotugu Akaike, Parzen E., Tanabe K., Kitagawa G., Eds. (Springer, 1998), pp. 199–213. [Google Scholar]
2.Schwarz G., Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978). [Google Scholar]
3.Mallows C. L., Some comments on Cp. Technometrics 15, 661–675 (1973). [Google Scholar]
4.Tibshirani R., Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267–288 (1996). [Google Scholar]
5.Zou H., The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006). [Google Scholar]
6.Fan J., Li R., Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001). [Google Scholar]
7.Zhang C., Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). [Google Scholar]
8.Hazimeh H., Mazumder R., Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Oper. Res., in press. [Google Scholar]
9.Natarajan B. K., Sparse approximate solutions to linear systems. SIAM J. Comput. 24, 227–234 (1995). [Google Scholar]
10.Blumensath T., Davies M. E., Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27, 265–274 (2009). [Google Scholar]
11.Hintermüller M., Ito K., Kunisch K., The primal-dual active set strategy as a semismooth Newton method. SIAM J. Optim. 13, 865–888 (2002). [Google Scholar]
12.Bertsimas D., King A., Mazumder R., Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016). [Google Scholar]
13.Wang L., Kim Y., Li R., Calibrating non-convex penalized regression in ultra-high dimension. Ann. Stat. 41, 2505–2536 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen J., Chen Z., Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008). [Google Scholar]
15.Zhang C., Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36, 1567–1594 (2008). [Google Scholar]
16.Bickel P. J., Ritov Y., Tsybakov A. B., Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009). [Google Scholar]
17.Raskutti G., Wainwright M. J., Yu B., Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010). [Google Scholar]
18.Candes E. J., Tao T., Decoding by linear programming. IEEE Trans. Inf. Theor. 51, 4203–4215 (2005). [Google Scholar]
19.Huang J., Jiao Y., Liu Y., Lu X., A constructive approach to $l_{0}$ penalized regression. J. Mach. Learn. Res. 19, 1–37 (2018). [Google Scholar]
20.Zheng Z., Bahadori M. T., Liu Y., Lv J., Scalable interpretable multi-response regression via seed. J. Mach. Learn. Res. 20, 1–34 (2019). [Google Scholar]
21.Miller A., Subset Selection in Regression (CRC Press, 2002). [Google Scholar]
22.Friedman J., Hastie T., Tibshirani R., Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33, 1–22 (2010). [PMC free article] [PubMed] [Google Scholar]
23.Breheny P., Huang J., Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5, 232–253 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang Y., Li R., Tsai C. L., Regularization parameter selections via generalized information criterion. J. Am. Stat. Assoc. 105, 312–323 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.2014241117.sapp.pdf^{(459.7KB, pdf)}

Data Availability Statement

All study data are included in the article and SI Appendix.

[r1] 1.Akaike H., “Information theory and an extension of the maximum likelihood principle” in Selected Papers of Hirotugu Akaike, Parzen E., Tanabe K., Kitagawa G., Eds. (Springer, 1998), pp. 199–213. [Google Scholar]

[r2] 2.Schwarz G., Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978). [Google Scholar]

[r3] 3.Mallows C. L., Some comments on Cp. Technometrics 15, 661–675 (1973). [Google Scholar]

[r4] 4.Tibshirani R., Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267–288 (1996). [Google Scholar]

[r5] 5.Zou H., The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006). [Google Scholar]

[r6] 6.Fan J., Li R., Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001). [Google Scholar]

[r7] 7.Zhang C., Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). [Google Scholar]

[r8] 8.Hazimeh H., Mazumder R., Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Oper. Res., in press. [Google Scholar]

[r9] 9.Natarajan B. K., Sparse approximate solutions to linear systems. SIAM J. Comput. 24, 227–234 (1995). [Google Scholar]

[r10] 10.Blumensath T., Davies M. E., Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27, 265–274 (2009). [Google Scholar]

[r11] 11.Hintermüller M., Ito K., Kunisch K., The primal-dual active set strategy as a semismooth Newton method. SIAM J. Optim. 13, 865–888 (2002). [Google Scholar]

[r12] 12.Bertsimas D., King A., Mazumder R., Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016). [Google Scholar]

[r13] 13.Wang L., Kim Y., Li R., Calibrating non-convex penalized regression in ultra-high dimension. Ann. Stat. 41, 2505–2536 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Chen J., Chen Z., Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008). [Google Scholar]

[r15] 15.Zhang C., Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36, 1567–1594 (2008). [Google Scholar]

[r16] 16.Bickel P. J., Ritov Y., Tsybakov A. B., Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009). [Google Scholar]

[r17] 17.Raskutti G., Wainwright M. J., Yu B., Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010). [Google Scholar]

[r18] 18.Candes E. J., Tao T., Decoding by linear programming. IEEE Trans. Inf. Theor. 51, 4203–4215 (2005). [Google Scholar]

[r19] 19.Huang J., Jiao Y., Liu Y., Lu X., A constructive approach to $l_{0}$ penalized regression. J. Mach. Learn. Res. 19, 1–37 (2018). [Google Scholar]

[r20] 20.Zheng Z., Bahadori M. T., Liu Y., Lv J., Scalable interpretable multi-response regression via seed. J. Mach. Learn. Res. 20, 1–34 (2019). [Google Scholar]

[r21] 21.Miller A., Subset Selection in Regression (CRC Press, 2002). [Google Scholar]

[r22] 22.Friedman J., Hastie T., Tibshirani R., Regularization paths for generalized linear models via coordinate descent. J. Stat. Software 33, 1–22 (2010). [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Breheny P., Huang J., Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5, 232–253 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r24] 24.Zhang Y., Li R., Tsai C. L., Regularization parameter selections via generalized information criterion. J. Am. Stat. Assoc. 105, 312–323 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A polynomial algorithm for best-subset selection problem

Junxian Zhu

Canhong Wen

Jin Zhu

Heping Zhang

Xueqin Wang

Significance

Abstract

Method

Splicing.

ABESS.

Theoretical Results

Conditions.

Remark 1:

Remark 2:

Remark 3:

Remark 4:

Computational Theory.

Theorem 1.

Theorem 2.

Statistical Theory.

Lemma 1.

Theorem 3.

Theorem 4 (Consistency of Best-Subset Recovery).

Corollary 1 (Asymptotic Properties).

Simulation

Low-Dimensional Case.

Table 1.

Fig. 1.

High-Dimensional Case.

Fig. 2.

Summary

Supplementary Material

Acknowledgments

Footnotes

Data Availability.

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases