Penalized Bregman divergence for large-dimensional regression and classification

Chunming Zhang; Yuan Jiang; Yi Chai

doi:10.1093/biomet/asq033

. 2010 Jun 30;97(3):551–566. doi: 10.1093/biomet/asq033

Penalized Bregman divergence for large-dimensional regression and classification

Chunming Zhang ¹, Yuan Jiang ¹, Yi Chai ¹

PMCID: PMC3372245 PMID: 22822248

Summary

Regularization methods are characterized by loss functions measuring data fits and penalty terms constraining model parameters. The commonly used quadratic loss is not suitable for classification with binary responses, whereas the loglikelihood function is not readily applicable to models where the exact distribution of observations is unknown or not fully specified. We introduce the penalized Bregman divergence by replacing the negative loglikelihood in the conventional penalized likelihood with Bregman divergence, which encompasses many commonly used loss functions in the regression analysis, classification procedures and machine learning literature. We investigate new statistical properties of the resulting class of estimators with the number p_n of parameters either diverging with the sample size n or even nearly comparable with n, and develop statistical inference tools. It is shown that the resulting penalized estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but asymptotically does not rely on the complete specification of the underlying distribution. Furthermore, the choice of loss function in the penalized classifiers has an asymptotically relatively negligible impact on classification performance. We illustrate the proposed method for quasilikelihood regression and binary classification with simulation evaluation and real-data application.

Some key words: Consistency, Divergence minimization, Exponential family, Loss function, Optimal Bayes rule, Oracle property, Quasilikelihood

1. Introduction

Regularization is used to obtain well-behaved solutions to overparameterized estimation problems, and is particularly appealing in high dimensions. The topic is reviewed by Bickel & Li (2006). Regularization estimates a vector parameter of interest β ∈ 𝕉^p_n by minimizing the criterion function,

ℓ_{n} (β) = L_{n} (β) + P_{λ_{n}} (β) (λ_{n} > 0),

consisting of a data fit functional L_n, which measures how well β fits the observed set of data; a penalty functional P_{λ_n}, which assesses the physical plausibility of β; and a regularization parameter λ_n, which regulates the penalty. Depending on the nature of the output variable, the term L_n quantifies the error of an estimator by different error measures. For example, the quadratic loss function has nice analytical properties and is usually used in regression analysis. However, it is not always adequate in classification problems, where the misclassification loss, deviance loss, hinge loss for the support vector machine (Vapnik, 1996) and exponential loss for boosting (Hastie et al., 2001) are more realistic and commonly used in classification procedures.

Currently, most research on regularization methods is devoted to variants of penalty methods in conjunction with linear models and likelihood-based models in regression analysis. For linear model estimation with a fixed number p of parameters, Tibshirani (1996) introduced the L₁-penalty for the proposed lasso method, where the quadratic loss is in use. Theoretical properties related to the lasso have been intensively studied; see Knight & Fu (2000), Meinshausen & Buhlmann (2006) and Zhao & Yu (2006). Zou (2006) mentioned that the lasso is in general not variable selection consistent, but the adaptive lasso via combining appropriately weighted L₁-penalties is consistent. Huang et al. (2008) extended the results in Zou (2006) to high-dimensional linear models. Using the smoothly clipped absolute deviation penalty, Fan & Li (2001) showed that the penalized likelihood estimator achieved the oracle property: the resulting estimator is asymptotically as efficient as the oracle estimator. In their treatment, the number p_n of model parameters is fixed at p, and the loss function equals the negative loglikelihood. Fan & Peng (2004) extended the result to p_n diverging with n at a certain rate.

On the loss side, the literature on penalization methods includes much less discussion of either the role of the loss function in regularization for models other than linear or likelihood-based models, or the impact of different loss functions on classification performance. The least angle regression algorithm (Efron et al., 2004) for L₁-penalization was developed for linear models using the quadratic loss. Rosset & Zhu (2007) studied the piecewise linear regularized solution paths for differentiable and piecewise quadratic loss functions with L₁ penalty. It remains desirable to explore whether penalization methods using other types of loss functions can potentially benefit from the efficient least-angle regression algorithm. Moreover, theoretical results on the penalized likelihood are not readily translated into results for approaches, such as quasilikelihood (Wedderburn, 1974; McCullagh, 1983; Strimmer, 2003), where the distribution of the observations is unknown or not fully specified. Accordingly, a discussion of statistical inference for penalized estimation using a wider range of loss functions is needed.

In this study, we broaden the scope of penalization by incorporating loss functions belonging to the Bregman divergence class which unifies many commonly used loss functions. In particular, the quasilikelihood function and all loss functions mentioned previously in classification fall into this class. We introduce the penalized Bregman divergence by replacing the quadratic loss or the negative loglikelihood in penalized least-squares or penalized likelihood with Bregman divergence, and call the resulting estimator a penalized Bregman divergence estimator. Nonetheless, the Bregman divergence in general does not fulfill assumptions specifically imposed on the likelihood function associated with penalized likelihood.

We investigate new statistical properties of large-dimensional penalized Bregman divergence estimators, with dimensions dealt with separately in two cases:

Case I : p_{n} is diverging with n;

(1)

Case II : p_{n} is nearly comparable with n .

(2)

Zhang & Zhang (2010) give an application of the penalization method developed in this paper to estimating the hemodynamic response function for brain fMRI data where p_n is as large as n. The current paper shows that the penalized Bregman divergence estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but the asymptotic distribution does not rely on the complete specification of the underlying distribution. From the classification viewpoint, our study elucidates the applicability and consistency of various classifiers induced by penalized Bregman divergence estimators. Technical details of this paper are in the online Supplementary Material.

2. The penalized Bregman divergence estimator

2.1. Bregman divergence

We give a brief overview of Bregman divergence. For a given concave function q with derivative q′, Bregman (1967) introduced a device for constructing a bivariate function,

Q (ν, μ) = - q (ν) + q (μ) + (ν - μ) q^{'} (μ) .

(3)

Figure 1 displays Q and the corresponding q. It is readily seen that the concavity of q ensures the nonnegativity of Q. Moreover, for a strictly concave q, Q(ν, μ) = 0 is equivalent to ν = μ. However, since Q(ν, μ) is not generally symmetric in ν and μ, Q is not a metric or distance in the strict sense. Hence, we call Q the Bregman divergence and call q the generating function of Q. See Efron (1986), Lafferty et al. (1997), Lafferty (1999), Kivinen & Warmuth (1999), Grünwald & Dawid (2004), Altun & Smola (2006) and references therein.

Fig. 1 — Illustration of Q(*ν, μ*) as defined in (3). The concave curve is q; the two dashed lines indicate locations of ν and μ; the solid straight line is q(μ) + (ν – μ)q^′(μ); the length of the vertical line with arrows at each end is Q(*ν, μ*).

The Bregman divergence is suitable for a broad array of error measures Q. For example, q(μ) = aμ – μ² with some constant a yields the quadratic loss Q(Y, μ) = (Y – μ)². For a binary response variable Y, q(μ) = min{μ, (1 – μ)} gives the misclassification loss Q(Y, μ) = I {Y ≠ I (μ > 1/2)}, where I (·) denotes the indicator function; q(μ) = −{μ log(μ) + (1 –μ) log(1 – μ)} gives the Bernoulli deviance loss Q(Y, μ) = −{Y log(μ) + (1 – Y) log(1 – μ)}; q(μ) = 2 min{μ, (1 – μ)} results in the hinge loss; and q(μ) = 2{μ(1 – μ)}^1/2 yields the exponential loss Q(Y, μ) = exp[–(Y – 0.5) log{μ/(1 – μ)}].

Conversely, for a given Q, Zhang et al. (2009) provided necessary and sufficient conditions for Q being a Bregman divergence, and in that case derived an explicit formula for q. Applying this inverse approach from Q to q, they illustrated that the quasilikelihood function, the Kullback–Leibler divergence or the deviance loss for the exponential family of probability functions, and many margin-based loss functions (Shen et al., 2003) are Bregman divergences. To our knowledge, there is little theoretical work in the literature on thoroughly examining the penalized Bregman divergence, via methods of regularization, for large-dimensional model building, variable selection and classification problems.

2.2. The model and penalized Bregman divergence estimator

Let (X, Y) denote a random realization from some underlying population, where X = (X₁, …, X_{p_n})^T is the input vector and Y is the output variable. The dimension p_n follows the assumption in (1) or (2). We assume the parametric model,

m (x) = E (Y | X = x) = F^{- 1} (b_{0; 0} + x^{T} β_{0}),

(4)

where F is a known link function, b_0;0 ∈ 𝕉¹ and β₀ = (β_1;0, …, β_{p_n;0})^T ∈ 𝕉^p_n are the unknown true parameters. Throughout the paper, it is assumed that some entries in β₀ are exactly zero. Write $β_{0} = {β_{0}^{(I) T}, β_{0}^{(II) T}}^{T}$ , where $β_{0}^{(I)}$ collects all nonzero coefficients, and $β_{0}^{(II)} = 0$ .

Our goal is to estimate the true parameters via penalization. Let {(X₁, Y₁), …, (X_n, Y_n)} be a sample of independent random pairs from (X, Y), where X_i = (X_i₁, …, X_{ip_n})^T. The penalized Bregman divergence estimator (b̂₀, β̂) is defined as the minimizer of the criterion function,

ℓ_{n} (b_{0}, β) = \frac{1}{n} \sum_{i = 1}^{n} Q {Y_{i}, F^{- 1} (b_{0} + X_{i}^{T} β)} + \sum_{j = 1}^{p_{n}} P_{λ_{n}} (| β_{j} |),

(5)

where β = (β₁, …, β_{p_n})^T, the loss function Q(·, ·) is a Bregman divergence, and P_{λ_n} (·) represents a nonnegative penalty function indexed by a tuning constant λ_n > 0. Set β̃ = (b₀, β^T)^T, and correspondingly ${\tilde{X}}_{i} = {(1, X_{i}^{T})}^{T}$ . Then (5) can be written as

ℓ_{n} (\tilde{β}) = \frac{1}{n} \sum_{i = 1}^{n} Q {Y_{i}, F^{- 1} ({\tilde{X}}_{i}^{T} \tilde{β})} + \sum_{j = 1}^{p_{n}} P_{λ_{n}} (| β_{j} |) .

(6)

The penalized Bregman divergence estimator is β̃_E = (b̂₀, β̂₁, …, β̂_{p_n})^T = arg min_β̃ ℓ_n(β̃).

Regarding the uniqueness of β̃_E, assume that the quantities

q_{j} (y; θ) = (\partial^{j} / \partial θ^{j}) Q {y, F^{- 1} (θ)} (j = 0, 1, \dots),

(7)

exist finitely up to any order required. Provided that for all θ ∈ 𝕉 and all y in the range of Y,

q_{2} (y; θ) > 0,

(8)

it follows that $L_{n} (\tilde{β}) = n^{- 1} \sum_{i = 1}^{n} Q {Y_{i}, F^{- 1} ({\tilde{X}}_{i}^{T} \tilde{β})}$ in (6) is convex in β̃. In that case, if convex penalties are used in (6), then ℓ_n(β̃) is necessarily convex in β̃, and hence the local minimizer β̃_E is the unique global penalized Bregman divergence estimator. For nonconvex penalties, however, the local minimizer may not be globally unique.

3. Penalized Bregman divergence with nonconvex penalties: p_n ≪ n

3.1. Consistency

We start by introducing some notation. Let s_n denote the number of nonzero coordinates of β₀, and set ${\tilde{β}}_{0} = {(b_{0; 0}, β_{0}^{T})}^{T}$ . Define

a_{n} = max_{j = 1, \dots, s_{n}} | {P^{'}}_{λ_{n}} (| β_{j; 0} |) |, b_{n} = max_{j = 1, \dots, s_{n}} | {P^{″}}_{λ_{n}} (| β_{j; 0} |) |,

where $P_{λ}^{(j)} (| β |)$ is shorthand for (d^j/dx^j)P_λ(x)|_x_=|_β_|, j = 1, 2. Unless otherwise stated, ‖·‖ denotes the L₂-norm. Theorem 1 guarantees the existence of a consistent local minimizer for (6), and states that the local penalized Bregman divergence estimator β̃_E is (n/p_n)^1/2-consistent.

Theorem 1 (Existence and consistency). Assume Condition A in the Appendix, a_n = O(1/n^1/2) and b_n = o(1). If $p_{n}^{4} / n \to 0$ , (p_n/n)^1/2/λ_n → 0 and min_{j=1,...,s_n} |β_j_;0|/λ_n → ∞ as n → ∞, then there exists a local minimizer β̃_E of (6) such that ‖β̃_E – β̃₀‖ = O_P {(p_n/n)^1/2}.

3.2. Oracle property

Following Theorem 1, the oracle property of the local minimizer is given in Theorem 2 below. Before stating it, we need some notation. Write X = (X^(I)_T, X^(II)_T)^T, X̃^(I) = (1, X^(I)_T)^T, and β̃^(I) = (b₀, β^(I)_T)^T. For the penalty term, let

\begin{array}{l} d_{n} = {0, {P^{'}}_{λ_{n}} (| β_{1; 0} |) sign (β_{1; 0}), \dots, {P^{'}}_{λ_{n}} (| β_{s_{n}; 0} |) sign (β_{s_{n}; 0})}^{T}, \\ Σ_{n} = diag {0, {P^{″}}_{λ_{n}} (| β_{1; 0} |), \dots, {P^{″}}_{λ_{n}} (| β_{s_{n}; 0} |)} . \end{array}

For the q function, define F_n = q⁽²⁾{m(X)}/[F⁽¹⁾{m(X)}]²X̃^(I)X̃^(I)_T and

Ω_{n} = E [var (Y | X) q^{(2)} {m (X)} F_{n}], H_{n} = - E (F_{n}) .

Theorem 2 (Oracle property). Assume Condition B in the Appendix.

If $p_{n}^{2} / n = O (1)$ , (p_n/n^1/2)/λ_n → 0 and lim inf_n_→∞ lim inf_θ_→0+ P′_{λ_n} (θ)/λ_n > 0 as n → ∞, then any (n/ p_n)^1/2-consistent local minimizer ${\tilde{β}}_{E} = {({\tilde{β}}_{E}^{(I) T}, {\hat{β}}^{(II) T})}^{T}$ satisfies pr(β̂^(II) = 0) → 1.
Moreover, if a_n = O(1/n^1/2), $p_{n}^{5} / n \to 0$ and min_{j=1,...,s_n}|β_j_;0|/λ_n → ∞, then for any fixed integer k and any k × (s_n + 1) matrix A_n such that A_n $A_{n}^{T} \to G$ with G being a k × k nonnegative-definite symmetric matrix, $n^{1 / 2} A_{n} Ω_{n}^{- 1 / 2} {(H_{n} + Σ_{n}) ({\tilde{β}}_{E}^{(I)} + {\tilde{β}}_{0}^{(I)}) + d_{n}} \to N (0, G)$ in distribution.

Theorem 2 has some useful consequences: First, the p_n-dimensional penalized Bregman divergence estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator of Fan & Peng (2004): the estimators of the zero parameters take exactly zero values with probability tending to 1, and the estimators of the nonzero parameters are asymptotically normal with the same means and variances as if the zero coefficients were known in advance. Second, the asymptotic distribution of the penalized Bregman divergence estimator relies on the underlying distribution of Y | X through E(Y | X) and var(Y | X), but does not require a complete specification of the underlying distribution. Third, the asymptotic distribution depends on the choice of the Q-loss only through the second derivative of its generating q function. This enables us to evaluate the impact of loss functions on the penalized Bregman divergence estimators and to derive an optimal loss function in certain situations.

According to Theorem 2, the asymptotic covariance matrix of ${\tilde{β}}_{E}^{(I)}$ is V_n = (H_n + Σ_n)⁻¹ Ω_n(H_n + Σ_n)⁻¹. In practice, V_n is unknown and needs to be estimated. Typically, the sandwich formula can be exploited to form an estimator of V_n by

{\hat{V}}_{n} = {({\hat{H}}_{n} + {\hat{Σ}}_{n})}^{- 1} {\hat{Ω}}_{n} {({\hat{H}}_{n} + {\hat{Σ}}_{n})}^{- 1},

(9)

where ${\hat{Ω}}_{n} = n^{- 1} \sum_{i = 1}^{n} q_{1}^{2} (Y_{i}; {\tilde{X}}_{i}^{{(I)}_{T}} {\tilde{β}}_{E}^{(I)}) {\tilde{X}}_{i}^{(I)} {\tilde{X}}_{i}^{{(I)}_{T}}$ , ${\hat{H}}_{n} = n^{- 1} \sum_{i = 1}^{n} q_{2} (Y_{i}; {\tilde{X}}_{i}^{{(I)}_{T}} {\tilde{β}}_{E}^{(I)}) {\tilde{X}}_{i}^{(I)} {\tilde{X}}_{i}^{{(I)}_{T}}$ and Σ̂_n = diag{0, P″_{λ_n} (|β̂₁|), …, P″_{λ_n} (|β̂_{s_n}|)}.

Proposition 1 below demonstrates that for any (n/ p_n)^1/2-consistent estimator ${\tilde{β}}_{E}^{(I)}$ of ${\tilde{β}}_{0}^{(I)}$ , V̂_n is a consistent estimator for the covariance matrix V_n, in the sense that $A_{n} ({\hat{V}}_{n} - V_{n}) A_{n}^{T} \to 0$ in probability for any k × (s_n + 1) matrix A_n satisfying $A_{n} A_{n}^{T} \to G$ , where k is any fixed integer.

Proposition 1 (Covariance matrix estimation). Assume Condition B in the Appendix, and b_n = o(1). If $p_{n}^{4} / n \to 0$ , (p_n/n)^1/2/λ_n → 0 and min_{j=1,...,s_n} |β_j_;0|/λ_n → ∞ as n → ∞, then for any $‖ {\tilde{β}}_{E}^{(I)} - {\tilde{β}}_{0}^{(I)} ‖ = O_{P} {{(p_{n} / n)}^{1 / 2}}$ , we have that $A_{n} ({\hat{V}}_{n} - V_{n}) A_{n}^{T} \to 0$ in probability for any k × (s_n + 1) matrix A_n satisfying $A_{n} A_{n}^{T} \to G$ , where G is a k × k matrix.

Is there an optimal choice of q such that the corresponding V_n matrix achieves its lower bound? We have that $V_{n} = H_{n}^{- 1} Ω_{n} H_{n}^{- 1}$ in two special cases. One is Σ_n = 0 for large n and large min_{j=1,...,s_n} |β_j_;0|, which results from the smoothly clipped absolute deviation and hard thresholding penalties; another one is Σ_n = 0 for all n, which results from the weighted L₁-penalties in Theorem 6 below. In these cases, it can be shown via matrix algebra that the optimal q satisfies the generalized Bartlett identity in (11) below. On the other hand, for an arbitrary Σ_n ≠ 0, the complication rises; the optimal q is generally not available in closed-form.

3.3. Hypothesis testing

We consider hypothesis testing about ${\tilde{β}}_{0}^{(I)}$ formulated as

H_{0} : A_{n} {\tilde{β}}_{0}^{(I)} = 0 versus H_{1} : A_{n} {\tilde{β}}_{0}^{(I)} \neq 0,

(10)

where A_n is a given k × (s_n + 1) matrix such that $A_{n} A_{n}^{T} = G$ with G being a k × k positive-definite matrix. This form of linear hypothesis allows one to test simultaneously whether a subset of variables used are statistically significant by taking some specific form of the matrix A_n; for example, A_n = [I_k, 0_{k, s_n + 1–k}] yields $A_{n} A_{n}^{T} = I_{k}$ .

We propose a generalized Wald-type test statistic of the form

W_{n} = n {(A_{n} {\tilde{β}}_{E}^{(I)})}^{T} {(A_{n} {\hat{H}}_{n}^{- 1} {\hat{Ω}}_{n} {\hat{H}}_{n}^{- 1} A_{n}^{T})}^{- 1} (A_{n} {\tilde{β}}_{E}^{(I)}),

where Ω̂_n and Ĥ_n are as defined in (9). This test is asymptotically distribution-free, as Theorem 3 justifies that, under the null, W_n would for large n be distributed as $χ_{k}^{2}$ .

Theorem 3 (Wald-type test under H₀). Assume Condition C in the Appendix, and let a_n = o{1/(ns_n)^1/2} and $b_{n} = o (1 / p_{n}^{1 / 2})$ . If $p_{n}^{5} / n \to 0$ , (p_n/n)^1/2/λ_n → 0 and min_{j=1,...,s_n}|β _j_;0|/λ_n → ∞ as n → ∞, then under H₀ in (10), $W_{n} \to χ_{k}^{2}$ in distribution.

Remark 1. To appreciate the discriminating power of W_n in assessing the significance, the asymptotic power can be analyzed. It can be shown that under H₁ in (10) where ‖A_nβ̃₀‖ is independent of n, W_n → +∞ in probability at the rate n. Hence W_n has power function tending to 1 against fixed alternatives. Besides, W_n has a nontrivial local power detecting contiguous alternatives approaching the null at the rate n^−1/2. We omit the lengthy details.

In the context of penalized likelihood estimator β̃_E, Fan & Peng (2004) showed that the likelihood-ratio-type test statistic

Λ_{n} = 2 n {min_{\tilde{β} \in ℝ^{p_{n} + 1} : A_{n} {\tilde{β}}^{(I)} = 0} ℓ_{n} (\tilde{β}) - ℓ_{n} ({\tilde{β}}_{E})}

follows an asymptotic χ² distribution under the null hypothesis. Theorem 4 below explores the extent to which this result can feasibly be extended to Λ_n constructed from the broad class of penalized Bregman divergence estimators.

Theorem 4 (Likelihood-ratio-type test under H₀). Assume (8) and Condition D in the Appendix, a_n = o{1/(ns_n)^1/2} and $b_{n} = o (1 / p_{n}^{1 / 2})$ . If $p_{n}^{5} / n \to 0$ , (p_n/n)^1/2/λ_n → 0 and min_{j=1,..., s_n} |β_j_;0|/λ_n → ∞ as n → ∞, then under H₀ in (10), provided that q satisfies the generalized Bartlett identity,

q^{(2)} {m (\cdot)} = - \frac{c}{var (Y | X = \cdot)},

(11)

for a constant c > 0, we have that $Λ_{n} / c \to χ_{k}^{2}$ in distribution.

Curiously, the result in Theorem 4 indicates that in general, condition (11) on q restricts the application domain of the test statistic Λ_n. For instance, in the case of binary responses, the Bernoulli deviance loss satisfies (11), but the quadratic loss and exponential loss violate (11). This limitation reflects that the likelihood-ratio-type test statistic Λ_n may not be straightforwardly valid for the penalized Bregman divergence estimators.

Remark 2. For a Bregman divergence Q, condition (11) with c = 1 is equivalent to the equality E[∂²Q{Y, m(·)}/∂m(·)² | X = ·] = E([∂ Q{Y, m(·)}/ ∂m(·)]² | X = ·), which includes the Bartlett identity (Bartlett, 1953) as a special case, when Q is the negative loglikelihood. Thus, we call (11) the generalized Bartlett identity. It is also seen that the quadratic loss satisfies (11) for homoscedastic regression models even without knowing the error distribution.

4. Penalized Bregman divergence with convex penalties: p_n ≈ n

4.1. Consistency, oracle property and hypothesis testing

For the nonconvex penalties discussed in § 3, the condition $p_{n}^{4} / n \to 0$ or $p_{n}^{5} / n \to 0$ can be relaxed to $p_{n}^{3} / n \to 0$ in the particular situation where the Bregman divergence is a quadratic loss and the link is an identity link. It remains unclear whether p_n can be relaxed in other cases.

This section aims to improve the rate of consistency of the penalized Bregman divergence estimators and to relax conditions on p_n using certain convex penalties, the weighted L₁-penalties, under which the penalized Bregman divergence estimator β̃_E = (b̂₀, β̂^T)^T is defined to minimize the criterion function,

ℓ_{n} (\tilde{β}) = \frac{1}{n} \sum_{i = 1}^{n} Q {Y_{i}, F^{- 1} ({\tilde{X}}_{i}^{T} \tilde{β})} + λ_{n} \sum_{j = 1}^{p_{n}} w_{j} | β_{j} |,

(12)

with w₁, …, w_{p_n} representing nonnegative weights. Define

w_{max}^{(I)} = max_{j = 1, \dots, s_{n}} w_{j}, w_{min}^{(I)} = min_{s_{n} + 1 ⩽ j ⩽ p_{n}} w_{j} .

Lemma 1 obtains the existence of a (n/p_n)^1/2-consistent local minimizer of (12). This rate is identical to that in Theorem 1 but, unlike Theorem 1, Lemma 1 includes the L₁-penalty. Other results parallel to those in § 3 can similarly be obtained.

Lemma 1 (Existence and consistency). Assume Conditions A1–A7 in the Appendix and $w_{max}^{(I)} = O_{P} {1 / (λ_{n} n^{1 / 2})}$ . If $p_{n}^{4} / n \to 0$ as n → ∞, then there exists a local minimizer β̃_E of (12) such that ||β̃_E – β̃₀|| = O_P {(p_n/n)^1/2}.

Lemma 1 imposes a condition on the weights of nonzero coefficients alone, but ignores the weights on zero coefficients. Theorem 5 below reflects that incorporating appropriate weights to the zero coefficients can improve the rate of consistency from (p_n/n)^1/2 to (s_n/n)^1/2.

Theorem 5 (Existence and consistency). Assume Conditions A1–A7 in the Appendix, $w_{max}^{(I)} = O_{P} {1 / (λ_{n} n^{1 / 2})}$ and there exists a constant M ∈ (0, ∞) such that ${lim}_{n \to \infty} pr (w_{min}^{(II)} λ_{n} > M) = 1$ . If $s_{n}^{4} / n \to 0$ and s_n(p_n – s_n) = o(n), then there exists a local minimizer β̃_E of (12) such that ||β̃_E – β̃₀|| = O_P {(s_n/n)^1/2}.

More importantly, conditions on the dimension p_n are much relaxed. For example, Theorem 5 allows p_n = o(n⁽³⁺^δ^)/(4+^δ⁾) for any δ > 0, provided s_n = O(n^1/(4+^δ⁾), whereas Theorem 1 requires p_n = o(n^1/4) for any s_n ⩽ p_n. This implies that p_n can indeed be relaxed to the case (2) of being nearly comparable with n. On the other hand, the proof of Theorem 5 relies on the flexibility of the weights {w_j}, as seen in an $I_{2, 1}^{(II)}$ term. Thus, directly carrying the proof of Theorem 5 through to either the nonconvex penalties in Theorem 1 or the L₁-penalty is not feasible.

Theorem 6 gives an oracle property for the (n/s_n)^1/2-consistent local minimizer.

Theorem 6 (Oracle property). Assume Conditions A1, A2, B3, A4, B5, A6–A7 in the Appendix.

If $s_{n}^{2} / n = O (1)$ and $w_{min}^{(II)} λ_{n} n^{1 / 2} / {(s_{n} p_{n})}^{1 / 2} \to \infty$ in probability as n → ∞, then any (n/s_n)^1/2-consistent local minimizer ${\tilde{β}}_{E} = {({\tilde{β}}_{E}^{{(I)}_{T}}, {\hat{β}}^{{(II)}_{T}})}^{T}$ satisfies pr(β̂^(II) = 0) → 1.
Moreover, if $w_{max}^{(I)} = O_{P} {1 / (λ_{n} n^{1 / 2})}$ , $s_{n}^{5} / n \to 0$ and min_{j=1,...,s_n} |β_j_;0|/(s_n/n)^1/2 → ∞, then for any fixed integer k and any k × (s_n + 1) matrix A_n such that $A_{n} A_{n}^{T} \to G$ with G being a k×k nonnegative-definite symmetric matrix, $n^{1 / 2} A_{n} Ω_{n}^{- 1 / 2} {H_{n} ({\tilde{β}}_{E}^{(I)} - {\tilde{β}}_{0}^{(I)}) + λ_{n} W_{n} sign ({\tilde{β}}_{0}^{(I)})} \to N (0, G)$ in distribution, where W_n = diag(0, w₁, …, w_{s_n}) and $sign {{\tilde{β}}_{0}^{(I)}} = {sign (b_{0; 0}), \dots, sign (β_{s_{n}; 0})}^{T}$ .

For testing hypotheses of the form (10), the generalized Wald-type test statistic W_n proposed in § 3.3 continues to be applicable. Theorem 7 derives the asymptotic distribution of W_n.

Theorem 7 (Wald-type test under H₀). Assume Conditions A1, A2, B3, C4, B5, A6–A7 in the Appendix, and that $w_{max}^{(I)} = o_{P} [1 / {λ_{n} {(n s_{n})}^{1 / 2}}]$ . If $s_{n}^{5} / n \to 0$ and min_{j=1,..., s_n} |β_j_;0|/(s_n/n)^1/2 → ∞ as n → ∞, then under H₀ in (10), $W_{n} \to χ_{k}^{2}$ in distribution.

4.2. Weight selection

We propose a penalized componentwise regression method for selecting weights by

{\hat{w}}_{j} = {| {\hat{β}}_{j}^{PCR} |}^{- 1} (j = 1, \dots, p_{n}),

(13)

based on some initial estimator, ${\hat{β}}^{PCR} = {({\hat{β}}_{1}^{PCR}, \dots, {\hat{β}}_{p_{n}}^{PCR})}^{T}$ , minimizing

ℓ_{n}^{PCR} (β) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{p_{n}} Q {Y_{i}, F^{- 1} (X_{i j} β_{j})} + κ_{n} \sum_{j = 1}^{p_{n}} | β_{j} |,

(14)

with some sequence κ_n > 0. Theorem 8 indicates that under assumptions on the correlation between the predictor variables and the response variable, the weights selected by the penalized componentwise regression satisfy the conditions in Theorem 5.

Theorem 8 (Penalized componentwise regression for weights: p_n ≈n). Assume Conditions A1, A2, B3, A4, A6, A7 and E. Assume that in Condition E, 𝒜_n = λ_nn^1/2, 𝒜_n/κ_n → ∞ and 𝒝_n/κ_n = O(1) for κ_n in (14). Suppose λ_nn^1/2 = O(1), λ_n = o(κ_n) and $log (p_{n}) = o (n κ_{n}^{2})$ . Assume that E(X) = 0 in model (4). Then there exist local minimizers ${\hat{β}}_{j}^{PCR}$ (j = 1, …, p_n), of (14) such that the weights ŵ_j (j = 1, …, p_n), defined in (13) satisfy that ${\hat{w}}_{max}^{(I)} = O_{P} {1 / λ_{n} \sqrt n}$ and ${\hat{w}}_{min}^{(II)} λ_{n} \to \infty$ in probability in Theorem 5, where ${\hat{w}}_{max}^{(I)} = {max}_{j = 1, \dots, s_{n}} {\hat{w}}_{j}$ and ${\hat{w}}_{min}^{(II)} = {min}_{s_{n} + 1 ⩽ j ⩽ p_{n}} {\hat{w}}_{j}$ .

5. Consistency of the penalized Bregman divergence classifier

This section deals with the binary response variable Y, which takes values 0 and 1. In this case, the mean regression function m(x) in (4) becomes the class label probability, pr(Y = 1 | X = x). From the penalized Bregman divergence estimator (b̂₀, β̂^T)^T proposed in either § 3 or § 4, we can construct the penalized Bregman divergence classifier, ϕ̂ (x) = I {m̂(x) > 1/2}, for a future input variable x, where m̂(x) = F⁻¹(b̂₀ + x^Tβ̂).

In the classification literature, the misclassification loss of a classification rule ϕ at a sample point (x, y) is l{y, ϕ(x)} = I {y ≠ϕ(x)}. The risk of ϕ is the expected misclassification loss, R(ϕ) = E[l{Y, ϕ(X)}] = pr{ϕ(X) ≠ Y}. The optimal Bayes rule, which minimizes the risk with respect to ϕ, is ϕ_B(x) = I {m(x) > 1/2}. For a test sample (X^o, Y^o), which is an independent and identically distributed copy of samples in the training set 𝒯_n = {(X_i, Y_i), i = 1, …, n}, the optimal Bayes risk is then R(ϕ_B) = pr{ϕ_B(X^o) ≠Y^o}. Meanwhile, the conditional risk of the penalized Bregman divergence classification rule ϕ̂ is R(ϕ̂) = pr{ϕ̂(X^o) ≠ Y^o | 𝒯_n}. For ϕ̂ induced by the penalized Bregman divergence regression estimation using a range of loss functions combined with either the smoothly clipped absolute deviation, L₁ or weighted L₁-penalties, Theorem 9 verifies the classification consistency attained by ϕ̂.

Theorem 9 (Consistency of the penalized Bregman divergence classifier). Assume Conditions A1 and A4 in the Appendix. Suppose that ||β̃_E – β̃ ₀|| = O_P (r_n). If $r_{n} p_{n}^{1 / 2} = o (1)$ , then the classification rule ϕ̂ constructed from β̃_E is consistent in the sense that E { R(ϕ̂)} – R(ϕ_B) → 0 as n → ∞.

6. Simulation study

6.1. Set-up

For illustrative purposes, four procedures for penalized estimators are compared: (I) the smoothly clipped absolute deviation penalty, with an accompanying parameter a = 3.7, combined with the local linear approximation; (II) the L₁ penalty; (III) the weighted L₁-penalties with weights selected by (13); and (IV) the oracle estimator using the set of significant variables. Throughout the numerical work in the paper, methods (I)–(III) utilize the least angle regression algorithm, F is the log link for count data and the logit link for binary response variables.

6.2. Penalized quasilikelihood for overdispersed count data

A quasilikelihood function Q relaxes the distributional assumption on a random variable Y via the specification ∂Q(Y, μ)/∂μ = (Y – μ)/ V(μ), where var(Y | X = x) = V {m(x)} for a known continuous function V (·) > 0. Zhang et al. (2009) verified that the quasilikelihood function belongs to the Bregman divergence and derived the generating q function,

q (μ) = \int_{- \infty}^{μ} \frac{s - μ}{V (s)} d s .

(15)

We generate overdispersed Poisson counts Y_i satisfying var(Y_i | X_i = x_i) = 2m(x_i). In the predictor X_i = (X_i₁, …, X_{ip_n})^T, p_n = n/8, n/2 and n – 10, and X_i₁ = i/n – 0.5. For j = 2, …, p_n, X_ij = Φ(Z_ij) – 0.5, where Φ is the standard normal distribution function, and (Z_i₂, …, Z_{ip_n})^T ∼ N{0, $ρ 1_{ρ_{n} - 1} 1_{p_{n} - 1}^{T} + (1 - ρ) I_{ρ_{n} - 1}$ }, with 1_d a d × 1 vector of ones and I_d a d × d identity matrix. Thus (X_i₂, …, X_{ip_n}) are correlated Un(0, 1) if ρ ≠ 0. The link function is log{m(x)} = b_0;0 + x^Tβ₀, where b_0;0 = 5 and β₀ = (2, 2, 0, 0, …, 0)^T.

First, to examine the effect of penalized regression estimates on model fitting, we generate 200 training sets of size n. For each training set, the model error is calculated by $\sum_{l = 1}^{L} {\hat{m} (x_{l}) - m (x_{l})}^{2} / L$ , at a randomly generated sequence ${x_{l}}_{l = 1}^{L = 5000}$ , and the relative model error is the ratio of model error using penalized estimators and that using nonpenalized estimators. The tuning constants λ_n for the training set in each simulation for methods (I)–(II) are selected separately by minimizing the quasilikelihood on a test set of size equal to that of the training set; λ_n and κ_n for method (III) are searched on a surface of grid points. The mean relative model error can be obtained from those 200 training sets. Table 1 summarizes the penalized quasilikelihood estimates of parameters by means of (15). It is clearly seen that if the true model coefficients are sparse, the penalized estimators reduce the function estimation error compared with the nonpenalized estimators.

Table 1.

Simulation results from penalized quasilikelihood estimates, with dependent predictors. n = 200, ρ = 0.2

Loss	p_n	Method	Regression	Variable selection
Loss	p_n	Method	MRME	CZ (sd)	IZ (sd)
Quasilikelihood	n/8	SCAD	0.2428	17.74 (5.46)	0 (0)
		L₁	0.3503	14.21 (4.91)	0 (0)
		Weighted L₁	0.1077	21.32 (2.48)	0 (0)
		Oracle	0.0861	23	–
Quasilikelihood	n/2	SCAD	0.0409	91.73 (12.56)	0 (0)
		L₁	0.0712	88.00 (14.94)	0 (0)
		Weighted L₁	0.0161	94.84 (5.89)	0 (0)
		Oracle	0.0105	98	–
Quasilikelihood	n – 10	SCAD	0.0010	184.37 (8.52)	0 (0)
		L₁	0.0019	181.13 (13.87)	0 (0)
		Weighted L₁	0.0004	185.25 (4.97)	0 (0)
		Oracle	0.0002	188	–

Open in a new tab

SCAD, smoothly clipped absolute deviation; MRME, mean of relative model errors obtained from the training sets; CZ, average number of coefficients that are correctly estimated to be zero when the true coefficients are zero; IZ, average number of coefficients that are incorrectly estimated to be zero when the true coefficients are nonzero; sd, standard deviation.

Second, to study the utility of penalized estimators in revealing the effects in variable selection under quasilikelihood, Table 1 gives the average number of coefficients that are correctly estimated to be zero when the true coefficients are zero, and the average number of coefficients that are incorrectly estimated to be zero when the true coefficients are nonzero. The standard deviations of the corresponding estimations across 200 training sets are given in brackets. Overall, the penalized estimators help yield a sparse solution and build a sparse model. These results lend support to the theoretical results in § 3 and § 4.

In summary, the smoothly clipped absolute deviation and weighted L₁ penalties outperform the L₁ penalty in terms of regression estimation and variable selection. As expected, the oracle estimator, which is practically infeasible, performs better than the three penalized estimators.

6.3. Penalized Bregman divergence for binary classification

We generate data with two-classes from the model,

X = {(X_{1}, \dots, X_{p_{n}})}^{T} ~ N (0, Σ), Y | X = x ~ Ber {m (x)},

where p_n = n/8, n/2, n – 10, $Σ = ρ 1_{ρ_{n}} 1_{p_{n}}^{T} + (1 - ρ) I_{ρ_{n}}$ and logit{m(x)} = b_0;0 + x^T β₀ with b_0;0 = 3 and β₀ = (1.5, 2, −2, −2.5, 0, 0, …, 0)^T. Table 2 summarizes the penalized estimates of parameters. The results reinforce the conclusion drawn in § 6.2.

Table 2.

Simulation results from penalized Bregman divergence estimates for binary classification, with dependent predictors. n = 200, ρ = 0.2

Loss	p_n	Method	Regression	Variable selection		Classification
Loss	p_n	Method	MRME	CZ (sd)	IZ (sd)	MAMR
Deviance	n/8	SCAD	0.2504	18.86 (4.37)	0.01 (0.10)	0.1153
		L₁	0.3774	11.31 (5.48)	0.00 (0.00)	0.1218
		Weighted L₁	0.2409	18.11 (2.26)	0.01 (0.10)	0.1160
		Oracle	0.1164	21	0	0.1042
Exponential	n/8	SCAD	0.2566	18.92 (4.13)	0.00 (0.00)	0.1162
		L₁	0.3356	12.28 (5.54)	0.00 (0.00)	0.1232
		Weighted L₁	0.2176	19.07 (1.66)	0.01 (0.10)	0.1175
		Oracle	0.1276	21	0	0.1042
Deviance	n/2	SCAD	0.0612	94.74 (2.32)	0.03 (0.17)	0.1166
		L₁	0.1148	76.39 (12.97)	0.00 (0.00)	0.1313
		Weighted L₁	0.0782	89.00 (6.38)	0.04 (0.19)	0.1235
		Oracle	0.0240	96	0	0.1043
Exponential	n/2	SCAD	0.0915	94.37 (2.91)	0.05 (0.21)	0.1209
		L₁	0.1141	76.05 (11.99)	0.00 (0.00)	0.1315
		Weighted L₁	0.0723	90.60 (4.70)	0.04 (0.19)	0.1222
		Oracle	0.0310	96	0	0.1043
Deviance	n – 10	SCAD	0.0230	185.09 (1.53)	0.02 (0.14)	0.1136
		L₁	0.0847	158.19 (17.26)	0.00 (0.00)	0.1401
		Weighted L₁	0.0539	176.51 (8.17)	0.03 (0.17)	0.1273
		Oracle	0.0121	186	0	0.1044
Exponential	n – 10	SCAD	0.0360	184.62 (2.20)	0.01 (0.10)	0.1170
		L₁	0.0746	161.15 (14.73)	0.00 (0.00)	0.1386
		Weighted L₁	0.0489	178.70 (5.91)	0.04 (0.19)	0.1271
		Oracle	0.0150	186	0	0.1044

Open in a new tab

MAMR, mean of the average misclassification rates calculated from training sets.

Moreover, to investigate the performance of penalized classifiers, we evaluate the average misclassification rate for 10 independent test sets of size 10 000. Table 2 reports the mean of the average misclassification rates calculated from 100 training sets. Evidently, all penalized classifiers perform as well as the optimal Bayes classifier. This agrees with results of Theorem 9 on the asymptotic classification consistency. Furthermore, the choice of loss functions in the penalized classifiers has an asymptotically relatively negligible impact on classification performance.

7. Real data

The Arrhythmia dataset (Güvenir et al., 1997) consists of 452 patient records in the diagnosis of cardiac arrhythmia. Each record contains 279 clinical measurements, from electrocardiography signals and other information such as sex, age and weight, along with the decision of an expert cardiologist. In the data, class 01 refers to normal electrocardiography, class 02–class 15 each refers to a particular type of arrhythmia, and class 16 refers to the unclassified remainder.

We intend to predict whether a patient can be categorized as having normal electrocardiography or not. After deleting missing values and class 16, the remaining 430 patients with 257 attributes are used in the classification. To evaluate the performance of the penalized estimates of model parameters in $logit {pr (Y = 1 | X_{1}, \dots, X_{257})} = b_{0} + \sum_{j = 1}^{257} β_{j} X_{j}$ , we randomly split the data into a training set and a test set in the ratio 2:1. For each training set, the tuning constant is selected by minimizing a 3-fold crossvalidated estimate of the misclassification rate; λ_n and κ_n for the penalized componentwise regression are found on a grid of points. We calculate the mean of the misclassification rates and the average number of selected variables over 100 random splittings. It is seen from Table 3 that the penalized classifier using the deviance loss and that using the exponential loss have similar values of misclassification rates. In contrast, the nonpenalized classifiers select all attributes, yielding much higher misclassification rates.

Table 3.

Arrhythmia data: mean misclassification rate and the average number of selected variables

Loss	Method	MMR	# Selected variables
Deviance	Nonpenalized	0.4265	257.00
	SCAD	0.2550	16.13
	L₁	0.2358	45.46
	Weighted L₁	0.2340	26.44
Exponential	Nonpenalized	0.4323	257.00
	SCAD	0.2666	15.83
	L₁	0.2397	43.79
	Weighted L₁	0.2366	18.77

Open in a new tab

MMR, mean of the misclassification rates.

Acknowledgments

The authors thank the editor, associate editor and two referees for insightful comments and suggestions. The research was supported by grants from the National Science Foundation and National Institutes of Health, U.S.A.

Appendix

For a matrix M, its eigenvalues, minimum eigenvalue, maximum eigenvalue and trace are labelled by λ_j (M), λ_min(M), λ_max(M) and tr(M), respectively. Let ‖M‖ = sup_‖_x_‖=1 ‖Mx‖ = {λ_max(M^TM)}^1/2 be the matrix L₂ norm; let ‖M‖_F = {tr(M^TM)}^1/2 be the Frobenius norm. See Golub & Van Loan (1996) for details. Throughout the proof, C is used as a generic finite constant.

We first impose some regularity conditions, which are not the weakest possible.

Condition A consists of the following.

A1. Assume ${sup}_{n ⩾ 1} {‖ {\tilde{β}}_{0}^{(I)} ‖}_{1} < \infty$ and ‖X‖_∞ is bounded;
A2. the matrix E(X̃X̃^T) exists and is nonsingular;
A3. assume E(Y²) < ∞;
A4. there is a large enough open subset of 𝕉^{p_n +1}, which contains the true parameter point β̃₀, such that F⁻¹(X̃^Tβ̃) is bounded for all β̃ in the subset;
A5. the eigenvalues of the matrix −E(q⁽²⁾{m(X)}/[F⁽¹⁾{m(X)}]²X̃X̃^T) are uniformly bounded away from 0;
A6. the function q⁽⁴⁾(·) is continuous, and q⁽²⁾(·) < 0;
A7. the function F(·) is a bijection, F⁽³⁾(·) is continuous and F⁽¹⁾(·) ≠0; and finally
A8. assume P_{λ_n} (0) = 0. There are constants C and D such that when θ₁ > Cλ_n and θ₂ > Cλ_n, |P″_{λ_n} (θ₁) − P″_{λ_n} (θ₂)| ⩽ D|θ₁ − θ₂|.

Condition B: These are identical to Condition A except that A3 and A5 are replaced by B3 and B5:

B3. there exists a constant C ∈ (0, ∞) such that E{|Y − m(X)|^j} ⩽ j!C^j for all j ⩾ 3. Also, inf_{n ⩾ 1, 1 ⩽ j ⩽ p_n} $E {var (Y | X) X_{j}^{2}} > 0$ ; and
B5. assume λ_j (Ω_n) and λ_j (H_n) are uniformly bounded away from 0; $‖ H_{n}^{- 1} Ω_{n} ‖$ is bounded away from ∞.

Condition C: These are identical to Condition B except that B4 is replaced by:

C4. there is an open subset of 𝕉^{p_n +1} which contains the true parameter point β̃₀, such that F⁻¹(X̃^Tβ̃) is bounded for all β̃ in the subset. Moreover, the subset contains the origin.

Condition D: This is identical to Condition C except that C5 is replaced by:

D5. assume λ_j (H_n) are uniformly bounded away from 0; $‖ H_{n}^{- 1 / 2} Ω_{n}^{1 / 2} ‖$ is bounded away from ∞.

Condition E is as follows.

E1. Assume min_{j=1,..., s_n} |E(X_jY)| ≽ 𝒜_n and max_{s_n+1 ⩽ j ⩽ p_n} |E(X_j Y)| = o(𝒝_n) for some positive sequences 𝒜_n and 𝒝_n, where s_n ≽ t_n, for two nonnegative sequences s_n and t_n, denotes that there exists a constant c > 0 such that s_n ⩾ c t_n for all n ⩾ 1.

Proof of Theorem 1. Let r_n = (p_n/n)^1/2 and ũ = (u₀, u₁, …, u_{p_n})^T ∈ 𝕉^p_n+1. Similar to Fan & Peng (2004), it suffices to show that for any given ∊ > 0, there is a large constant C_∊ such that, for large n,

pr {inf_{‖ \tilde{u} ‖ = C_{∊}} ℓ_{n} ({\tilde{β}}_{0} + r_{n} \tilde{u}) > ℓ_{n} ({\tilde{β}}_{0})} ⩾ 1 - ∊ .

(A1)

Define β̃_L = β̃₀ + r_nũ. To show (A1), consider

\begin{array}{l} D_{n} (\tilde{u}) & = & \frac{1}{n} \sum_{i = 1}^{n} [Q {Y_{i}, F^{- 1} ({\tilde{X}}_{i}^{T} {\tilde{β}}_{L})} - Q {Y_{i}, F^{- 1} ({\tilde{X}}_{i}^{T} {\tilde{β}}_{0})}] \\ + \sum_{j = 1}^{p_{n}} {P_{λ_{n}} (| β_{j; 0} + r_{n} u_{j} |) - P_{λ_{n}} (| β_{j; 0} |)} \equiv I_{1} + I_{2} . \end{array}

(A2)

First, we consider I₁. For μ = F⁻¹(θ), obtain q_j (y; θ) (j = 1, 2, 3), from (7). By Taylor’s expansion,

I_{1} = I_{1, 1} + I_{1, 2} + I_{1, 3},

(A3)

where $I_{1, 1} = r_{n} / n \sum_{i = 1}^{n} q_{1} (Y_{i}; {\tilde{X}}_{i}^{T} {\tilde{β}}_{0}) {\tilde{X}}_{i}^{T} \tilde{u}$ , $I_{1, 2} = r_{n}^{2} / (2 n) \sum_{i = 1}^{n} q_{2} (Y_{i}; {\tilde{X}}_{i}^{T} {\tilde{β}}_{0}) {({\tilde{X}}_{i}^{T} \tilde{u})}^{2}$ and $I_{1, 3} = r_{n}^{3} / (6 n) \sum_{i = 1}^{n} q_{3} (Y_{i}; {\tilde{X}}_{i}^{T} {\tilde{β}}^{*}) {({\tilde{X}}_{i}^{T} \tilde{u})}^{3}$ for β̃^* located between β̃₀ and β̃₀ + r_nũ. Hence |I_1,1| ⩽ O_P {r_n(p_n/n)^1/2}‖ũ‖ and $I_{1, 2} = - (r_{n}^{2} / 2) {\tilde{u}}^{T} E (q^{(2)} {m (X)} / {[F^{(1)} {m (X)}]}^{2} \tilde{X} {\tilde{X}}^{T}) \tilde{u} + O_{P} (r_{n}^{2} p_{n} / n^{1 / 2}) {‖ \tilde{u} ‖}^{2}$ . Conditions A1 and A4 give $| I_{1, 3} | ⩽ O_{P} (r_{n}^{3} p_{n}^{3 / 2}) {‖ \tilde{u} ‖}^{3}$ .

Next, we consider I₂. By Taylor’s expansion, $I_{2} ⩾ r_{n} \sum_{j = 1}^{s_{n}} P_{λ_{n}}^{'} (| β_{j; 0} |) sign (β_{j; 0}) u_{j} + (r_{n}^{2} / 2) \sum_{j = 1}^{s_{n}} P_{λ_{n}}^{(2)} (| β_{j}^{*} |) u_{j}^{2} \equiv I_{2, 1} + I_{2, 2}$ , for $β_{j}^{*}$ between β_j_;0 and β_j_;0 + r_nu_j. Thus, |I_2,1| ⩽ r_na_n‖u^(I)‖₁ and $| I_{2, 2} | ⩽ r_{n}^{2} b_{n} {‖ u^{(I)} ‖}^{2} + D r_{n}^{3} {‖ u^{(I)} ‖}^{3}$ , where u^(I) = (u₁, …, u_{s_n})^T. Since $p_{n}^{4} / n \to 0$ , we can choose some large C_∊ such that I_1,1, I_1,3, I_2,1 and I_2,2 are all dominated by I_1,2, which is positive by Condition A5. This implies (A1).

Proof of Lemma 1. Analogous to the proof of Theorem 1, it suffices to show (A1). Note that (A2) continues to hold with $I_{2} \equiv λ_{n} \sum_{j = 1}^{p_{n}} w_{j} (| β_{j; 0} + r_{n} u_{j} | - | β_{j; 0} |)$ and I₁ is unchanged. Clearly, $I_{2} ⩾ - λ_{n} r_{n} \sum_{j = 1}^{s_{n}} w_{j} | u_{j} | \equiv I_{2, 1}$ , in which $| I_{2, 1} | ⩽ λ_{n} r_{n} w_{max}^{(I)} {‖ u^{(I)} ‖}_{1}$ . The rest of the proof resembles that of Theorem 1 and is omitted.

Proof of Theorem 5. Write ũ = {ũ^(I)T, u^(II)T}^T, where ũ^(I) = (u₀, u₁, …, u_{s_n})^T and u^(II) = (u_{s_n+1}, …, u_{p_n})^T. Following the proof of Lemma 1, it suffices to show (A1) for r_n = (s_n/n)^1/2.

For I_1,1 in (A3), $I_{1, 1} = I_{1, 1}^{(I)} + I_{1, 1}^{(II)}$ according to ũ^(I) and u^(II). It follows that $| I_{1, 1}^{(I)} | ⩽ r_{n} O_{P} {{(s_{n} / n)}^{1 / 2}} {‖ {\tilde{u}}^{(I)} ‖}_{2}$ and $| I_{1, 1}^{(I)} | ⩽ r_{n} O_{P} (1 / n^{1 / 2}) {‖ u^{(II)} ‖}_{1}$ .

For I_1,2 in (A3), similar to the proof of Theorem 1, I_1,2 = I_1,2,1 + I_1,2,2. Define d_i = q⁽²⁾{m(X_i)}/[F⁽¹⁾{m(X_i)}]². This yields

I_{1, 2, 1} ⩾ - \frac{r_{n}^{2}}{2 n} \sum_{i = 1}^{n} d_{i} {(X_{i}^{{(I)}_{T}} {\tilde{u}}^{(I)})}^{2} - \frac{r_{n}^{2}}{n} \sum_{i = 1}^{n} d_{i} (X_{i}^{{(I)}_{T}} {\tilde{u}}^{(I)}) (X_{i}^{{(II)}_{T}} u^{(II)}) = I_{1, 2, 1}^{(I)} - I_{1, 2, 1}^{(cross)} .

Then there exists a constant C > 0 such that $I_{1, 2, 1}^{(I)} ⩾ C r_{n}^{2} {1 + o_{P} (1)} {‖ {\tilde{u}}^{(I)} ‖}_{2}^{2}$ and $| I_{1, 2, 1}^{(cross)} | ⩽ O_{P} (r_{n}^{2} s_{n}^{1 / 2}) {‖ {\tilde{u}}^{(I)} ‖}_{2} \cdot {‖ u^{(II)} ‖}_{1}$ . For I₁_,₂_,₂, partitioning ũ into ũ^(I) and u^(II) gives

I_{1, 2, 2} \equiv I_{1, 2, 2}^{(I)} + I_{1, 2, 2}^{(cross)} + I_{1, 2, 2}^{(II)},

where $| I_{1, 2, 2}^{(I)} | ⩽ r_{n}^{2} O_{P} (s_{n} / n^{1 / 2}) {‖ {\tilde{u}}^{(I)} ‖}_{2}^{2}$ , $| I_{1, 2, 2}^{(cross)} | ⩽ r_{n}^{2} O_{P} {{(s_{n} / n)}^{1 / 2}} {‖ {\tilde{u}}^{(I)} ‖}_{2} {‖ u^{(II)} ‖}_{1}$ and $| I_{1, 2, 2}^{(II)} | ⩽ r_{n}^{2} O_{P} (n^{- 1 / 2}) {‖ u^{(II)} ‖}_{1}^{2}$ .

For I_1,3 in (A3), since s_np_n = o(n), ‖β̃^*‖₁ is bounded and thus $| I_{1, 3} | ⩽ O_{P} (r_{n}^{3}) {‖ {\tilde{u}}^{(I)} ‖}_{1}^{3} + O_{P} (r_{n}^{3}) {‖ u^{(II)} ‖}_{1}^{3} \equiv I_{1, 3}^{(I)} + I_{1, 3}^{(II)}$ , where $| I_{1, 3}^{(I)} | ⩽ O_{P} (r_{n}^{3} {s_{n}}^{3 / 2}) {‖ {\tilde{u}}^{(I)} ‖}_{2}^{3}$ and $| I_{1, 3}^{(II)} | ⩽ O_{P} (r_{n}^{3}) {‖ u^{(II)} ‖}_{1}^{3}$ .

For I₂ in (A2), $I_{2} ⩾ I_{2, 1}^{(I)} + I_{2, 1}^{(II)}$ , where $I_{2, 1}^{(I)} = - λ_{n} r_{n} \sum_{j = 1}^{s_{n}} w_{j} | u_{j} |$ and $I_{2, 1}^{(II)} = λ_{n} r_{n} \sum_{j = s_{n} + 1}^{p_{n}} w_{j} | u_{j} |$ . Hence $| I_{2, 1}^{(I)} | ⩽ λ_{n} r_{n} w_{max}^{(I)} s_{n}^{1 / 2} {‖ u^{(I)} ‖}_{2}$ and $I_{2, 1}^{(II)} ⩾ λ_{n} r_{n} w_{min}^{(II)} {‖ u^{(II)} ‖}_{1}$ .

It can be shown that either $I_{1, 2, 1}^{(I)}$ or $I_{2, 1}^{(II)}$ dominates all other terms in groups, $𝒢_{1} = (I_{1, 2, 2}^{(I)}, I_{1, 3}^{(I)})$ , $𝒢_{2} = (I_{1, 1}^{(II)}, I_{1, 2, 2}^{(II)}, I_{1, 3}^{(II)}, I_{1, 2, 1}^{(cross)}, I_{1, 2, 2}^{(cross)})$ and $𝒢_{3} = (I_{1, 1}^{(I)}, I_{2, 1}^{(I)})$ . Namely, $I_{1, 2, 1}^{(I)}$ dominates 𝒢₁, and $I_{2, 1}^{(II)}$ dominates 𝒢₂. For 𝒢₃, if ‖u^(II)‖₁ ⩽ C_∊/2, then 𝒢₃ is dominated by $I_{1, 2, 1}^{(I)}$ , which is positive; if ‖u^(II)‖₁ > C_∊ /2, then 𝒢₃ is dominated by $I_{2, 1}^{(II)}$ , which is positive.

Proof of Theorem 8. Minimizing (14) is equivalent to minimizing $ℓ_{n, j}^{PCR} (α) = n^{- 1} \sum_{i = 1}^{n} Q {Y_{i}, F^{- 1} (X_{i j} α)} + κ_{n} | α |$ , for j = 1, …, p_n. The proof may be separated into two parts.

Part 1. To show ${\hat{w}}_{max}^{(I)} = O_{P} {1 / (λ_{n} n^{1 / 2})}$ , it suffices to show that for 𝒜_n = λ_nn^1/2, there exist local minimizers ${\hat{β}}_{j}^{PCR}$ of $ℓ_{n, j}^{PCR} (α)$ such that ${lim}_{δ \to 0} {inf}_{n ⩾ 1} pr ({min}_{1 ⩽ j ⩽ s_{n}} | {\hat{β}}_{j}^{PCR} | > 𝒜_{n} δ) = 1$ . It suffices to prove that for j = 1, …, s_n there exist some b_j with |b_j | = 2δ such that

lim_{δ \to 0} inf_{n ⩾ 1} pr [min_{1 ⩽ j ⩽ s_{n}} {inf_{| α | ⩽ δ} ℓ_{n, j}^{PCR} (𝒜_{n} α) - ℓ_{n, j}^{PCR} (𝒜_{n} b_{j})} > 0] = 1,

(A4)

and there exists some large enough C_n > 0 such that

lim_{δ \to 0} inf_{n ⩾ 1} pr [min_{1 ⩽ j ⩽ s_{n}} {inf_{| α | ⩾ C_{n}} ℓ_{n, j}^{PCR} (𝒜_{n} α) - ℓ_{n, j}^{PCR} (𝒜_{n} b_{j})} > 0] = 1.

(A5)

Note that (A5) holds, since for every n ⩾ 1, when |α| → ∞, ${min}_{1 ⩽ j ⩽ s_{n}} {ℓ_{n, j}^{PCR} (𝒜_{n} α) - ℓ_{n, j}^{PCR} (𝒜_{n} b_{j})} ⩾ κ_{n} 𝒜_{n} | α | - {max}_{j = 1, \dots, s_{n}} ℓ_{n, j}^{PCR} (𝒜_{n} b_{j}) \to \infty$ in probability. To prove (A4), note that |𝒜_nα| ⩽ 𝒜_nδ = O(1)δ → 0 as δ ↓ 0. By Taylor’s expansion,

\begin{array}{l} min_{j = 1, \dots, s_{n}} {inf_{| α | ⩽ δ} ℓ_{n, j}^{PCR} (𝒜_{n} α) - ℓ_{n, j}^{PCR} (𝒜_{n} b_{j})} ⩾ 𝒜_{n} min_{j = 1, \dots, s_{n}} inf_{| α | ⩽ δ} {(α - b_{j}) \frac{1}{n} \sum_{i = 1}^{n} q_{1} (Y_{i}; 0) X_{i j}} \\ + \frac{𝒜_{n}^{2}}{2} min_{j = 1, \dots, s_{n}} inf_{| α | ⩽ δ} {α^{2} \frac{1}{n} \sum_{i = 1}^{n} q_{2} (Y_{i}; X_{i j} 𝒜_{n} α_{j}^{*}) X_{i j}^{2} - b_{j}^{2} \frac{1}{n} \sum_{i = 1}^{n} q_{2} (Y_{i}; X_{i j} 𝒜_{n} b_{j}^{*}) X_{i j}^{2}} \\ + 𝒜_{n} min_{1 ⩽ j ⩽ s_{n}} inf_{| α | ⩽ δ} {κ_{n} (| α | - | b_{j} |)} \\ \equiv I_{1} + I_{2} + I_{3}, \end{array}

with $α_{j}^{*}$ between 0 and α and $b_{j}^{*}$ between 0 and b_j. Let μ₀ = F⁻¹(0) and C₀ = q″(μ₀)/ F′(μ₀) ≠ 0. Then

\begin{array}{l} I_{1} & = & 𝒜_{n} min_{j = 1, \dots, s_{n}} inf_{| α | ⩽ δ} {C_{0} (α - b_{j}) E (Y X_{j})} + 𝒜_{n} min_{j = 1, \dots, s_{n}} inf_{| α | ⩽ δ} [C_{0} (α - b_{j}) \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} X_{i j} - E (Y X_{j})}] \\ - 𝒜_{n} max_{1 ⩽ j ⩽ s_{n}} sup_{| α | ⩽ δ} {C_{0} μ_{0} (α - b_{j}) \frac{1}{n} \sum_{i = 1}^{n} X_{i j}} \\ \equiv & I_{1, 1} + I_{1, 2} + I_{1, 3} . \end{array}

We see that |I_1,3| ⩽ O_P [𝒜_n{log(s_n)/ n}^1/2]δ, by Bernstein’s inequality (Lemma 2.2.11 in van der Vaart & Wellner 1996). Again |I_1,2| = O_P [𝒜_n{log(s_n)/ n}^1/2]δ by an argument similar to that of Theorem 2. Choosing b_j = −2δsign{C₀ E(Y X_j)}, which satisfies |b_j| =2δ, gives $I_{1, 1} ⩾ | C_{0} |_{c} 𝒜_{n}^{2} δ$ . For I₂ and I₃, we observe that $| I_{2} | ⩽ | O_{P} (𝒜_{n}^{2}) δ^{2}$ and |I₃| = O(𝒜_nκ_n)δ.By the assumptions, we can choose a small enough δ > 0 such that with probability tending to 1, I₁_,₂, I₁_,₃, I₂ and I₃ are dominated by I₁_,₁, which is positive. Thus (A4) is proved.

Part 2. To verify that ${\hat{w}}_{min}^{(II)} λ_{n} \to \infty$ in probability, it suffices to prove that for any ∊ > 0, there exist local minimizers ${\hat{β}}_{j}^{PCR}$ of $ℓ_{n, j}^{PCR} (α)$ such that lim_n_→∞ pr(max_{s_n+1 ⩽ j ⩽ p_n}| ${\hat{β}}_{j}^{PCR}$ | ⩽ λ_n∊) = 1. Similar to the proof of Theorem 1, we will prove that for any ∊ > 0,

lim_{n \to \infty} pr [min_{j = s_{n} + 1, \dots, p_{n}} {inf_{| α | = ∊} ℓ_{n, j}^{PCR} (λ_{n} α) - ℓ_{n, j}^{PCR} (0)} > 0] = 1.

(A6)

For j = 1, …, s_n, by Taylor’s expansion,

\begin{array}{l} min_{j = s_{n} + 1, \dots, p_{n}} {inf_{| α | = ∊} ℓ_{n, j}^{PCR} (λ_{n} α) - ℓ_{n, j}^{PCR} (0)} ⩾ λ_{n} min_{j = s_{n} + 1, \dots, p_{n}} inf_{| α | = ∊} {α \frac{1}{n} \sum_{i = 1}^{n} q_{1} (Y_{i}; 0) X_{i j}} \\ + \frac{λ_{n}^{2}}{2} min_{j = s_{n} + 1, \dots, p_{n}} inf_{| α | = ∊} {α^{2} \frac{1}{n} \sum_{i = 1}^{n} q_{2} (Y_{i}; X_{i j} λ_{n} α_{j}^{*}) X_{i j}^{2}} + λ_{n} inf_{| α | = ∊} (κ_{n} | α |) \\ \equiv I_{1} + I_{2} + I_{3}, \end{array}

where $α_{j}^{*}$ is between 0 and α. Similar to the proof in Part 1, |I₁| ⩽ O_P [λ_n{log(p_n – s_n +1)/n}^1/2]∊ + o(λ_n𝒝_n)∊. Note that |I₂| ⩽ O_P ( $λ_{n}^{2}$ )∊² and I₃ = λ_nκ_n∊. By assumptions, with probability tending to 1, I₁ and I₂ are dominated by I₃ > 0. So (A6) is proved.

Proof of Theorem 9. We first need to show Lemma A1.

Lemma A1. Suppose that (X^o, Y^o) follows the distribution of (X, Y) and is independent of the training set 𝒯_n. If Q satisfies (3), then E[Q{Y^o, m̂ (X^o)}] = E[Q{Y^o, m(X^o)}] + E[Q{m(X^o), m̂ (X^o)}].

Proof. Let q be the generating function of Q. We deduce from Corollary 3, p. 223 of Chow & Teicher (1988) that E{q(Y^o) | 𝒯_n, X^o} = E{q(Y^o) | X^o} and E[Y^oq′{m̂ (X^o)} | 𝒯_n, X^o] = E(Y ^o | X^o)q′ {m̂ (X^o)} = m(X^o)q′{m̂ (X^o)}.

We now show Theorem 9. Setting Q in Lemma A1 to be the misclassification loss gives

\begin{array}{l} 1 / 2 [E {R (\hat{ϕ})} - R {(ϕ)}_{B})] & ⩽ & E [| m (X^{o}) - 0.5 | I {m (X^{o}) ⩽ 0.5, \hat{m} (X^{o}) > 0.5}] \\ + E [| m (X^{o}) - 0.5 | I {m (X^{o}) > 0.5, \hat{m} (X^{o}) ⩽ 0.5}] \\ \equiv & I_{1} + I_{2} . \end{array}

For any ∊ > 0, I₁ ⩽ pr{|m̂ (X^o) – m(X^o)| > ∊} + ∊ and I₂ ⩽ ∊ + pr{|m̂ (X^o) – m(X^o)| ⩾ ∊}. The proof completes by showing I₁ → 0 and I₂ → 0.

Supplementary material

Supplementary material is available at Biometrika online.

References

Altun Y, Smola A. Unifying divergence minimization and statistical inference via convex duality. In: Lugosi G, Simon HU, editors. Learning Theory: 19th Ann Conf Learn Theory. Berlin: Springer; 2006. pp. 139–53. [Google Scholar]
Bartlett MS. Approximate confidence intervals. Biometrika. 1953;40:12–19. [Google Scholar]
Bickel P, Li B. Regularization in statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]
Bregman LM. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comp Math Math Phys. 1967;7:620–31. [Google Scholar]
Chow YS, Teicher H. Probability Theory. 2nd ed. New York: Springer; 1988. [Google Scholar]
Efron B. How biased is the apparent error rate of a prediction rule? J Am Statist Assoc. 1986;81:461–70. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–99. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc. 2001;96:1348–60. [Google Scholar]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–61. [Google Scholar]
Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Baltimore, MD: Johns Hopkins University Press; 1996. [Google Scholar]
Grünwald PD, Dawid AP. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann Statist. 2004;32:1367–433. [Google Scholar]
Güvenir HA, Acar B, Demiröz G, Çekin A. A supervised machine learning algorithm for arrhythmia analysis. Comp Cardiol. 1997;24:433–6. [Google Scholar]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2001. [Google Scholar]
Huang J, Ma SG, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist. Sinica. 2008;18:1603–18. [Google Scholar]
Kivinen J, Warmuth MK. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Boosting as entropy projection; pp. 134–44. [Google Scholar]
Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–78. [Google Scholar]
Lafferty JD, Della Piestra S, Della Piestra V. Statistical learning algorithms based on Bregman distances. Proc. 5th Can Workshop Info Theory 1997 [Google Scholar]
Lafferty J. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Additive models, boosting, and inference for generalized divergences; pp. 125–33. [Google Scholar]
McCullagh P. Quasi-likelihood functions. Ann Statist. 1983;11:59–67. [Google Scholar]
Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the lasso. Ann Statist. 2006;34:1436–62. [Google Scholar]
Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30. [Google Scholar]
Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. J Am Statist Assoc. 2003;98:724–34. [Google Scholar]
Strimmer K. Modeling gene expression measurement error: a quasi-likelihood approach. BMC Bioinformatics. 2003;4:10. doi: 10.1186/1471-2105-4-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1996. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. [Google Scholar]
Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–47. [Google Scholar]
Zhang CM, Jiang Y, Shang Z. New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Can J Statist. 2009;37:119–39. [Google Scholar]
Zhang CM, Zhang ZJ. Regularized estimation of hemodynamic response function for fMRI data. Statist. Interface. 2010;3:15–32. [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–67. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J Am Statist Assoc. 2006;101:1418–29. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material is available at Biometrika online.

[b1-asq033] Altun Y, Smola A. Unifying divergence minimization and statistical inference via convex duality. In: Lugosi G, Simon HU, editors. Learning Theory: 19th Ann Conf Learn Theory. Berlin: Springer; 2006. pp. 139–53. [Google Scholar]

[b2-asq033] Bartlett MS. Approximate confidence intervals. Biometrika. 1953;40:12–19. [Google Scholar]

[b3-asq033] Bickel P, Li B. Regularization in statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]

[b4-asq033] Bregman LM. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comp Math Math Phys. 1967;7:620–31. [Google Scholar]

[b5-asq033] Chow YS, Teicher H. Probability Theory. 2nd ed. New York: Springer; 1988. [Google Scholar]

[b6-asq033] Efron B. How biased is the apparent error rate of a prediction rule? J Am Statist Assoc. 1986;81:461–70. [Google Scholar]

[b7-asq033] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–99. [Google Scholar]

[b8-asq033] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc. 2001;96:1348–60. [Google Scholar]

[b9-asq033] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–61. [Google Scholar]

[b10-asq033] Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Baltimore, MD: Johns Hopkins University Press; 1996. [Google Scholar]

[b11-asq033] Grünwald PD, Dawid AP. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann Statist. 2004;32:1367–433. [Google Scholar]

[b12-asq033] Güvenir HA, Acar B, Demiröz G, Çekin A. A supervised machine learning algorithm for arrhythmia analysis. Comp Cardiol. 1997;24:433–6. [Google Scholar]

[b13-asq033] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2001. [Google Scholar]

[b14-asq033] Huang J, Ma SG, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist. Sinica. 2008;18:1603–18. [Google Scholar]

[b15-asq033] Kivinen J, Warmuth MK. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Boosting as entropy projection; pp. 134–44. [Google Scholar]

[b16-asq033] Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–78. [Google Scholar]

[b17-asq033] Lafferty JD, Della Piestra S, Della Piestra V. Statistical learning algorithms based on Bregman distances. Proc. 5th Can Workshop Info Theory 1997 [Google Scholar]

[b18-asq033] Lafferty J. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Additive models, boosting, and inference for generalized divergences; pp. 125–33. [Google Scholar]

[b19-asq033] McCullagh P. Quasi-likelihood functions. Ann Statist. 1983;11:59–67. [Google Scholar]

[b20-asq033] Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the lasso. Ann Statist. 2006;34:1436–62. [Google Scholar]

[b21-asq033] Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30. [Google Scholar]

[b22-asq033] Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. J Am Statist Assoc. 2003;98:724–34. [Google Scholar]

[b23-asq033] Strimmer K. Modeling gene expression measurement error: a quasi-likelihood approach. BMC Bioinformatics. 2003;4:10. doi: 10.1186/1471-2105-4-10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24-asq033] Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]

[b25-asq033] Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1996. [Google Scholar]

[b26-asq033] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. [Google Scholar]

[b27-asq033] Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–47. [Google Scholar]

[b28-asq033] Zhang CM, Jiang Y, Shang Z. New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Can J Statist. 2009;37:119–39. [Google Scholar]

[b29-asq033] Zhang CM, Zhang ZJ. Regularized estimation of hemodynamic response function for fMRI data. Statist. Interface. 2010;3:15–32. [Google Scholar]

[b30-asq033] Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–67. [Google Scholar]

[b31-asq033] Zou H. The adaptive lasso and its oracle properties. J Am Statist Assoc. 2006;101:1418–29. [Google Scholar]

PERMALINK

Penalized Bregman divergence for large-dimensional regression and classification

Chunming Zhang

Yuan Jiang

Yi Chai

Summary

1. Introduction

2. The penalized Bregman divergence estimator

2.1. Bregman divergence

Fig. 1.

2.2. The model and penalized Bregman divergence estimator

3. Penalized Bregman divergence with nonconvex penalties: p_n ≪ n

3.1. Consistency

3.2. Oracle property

3.3. Hypothesis testing

4. Penalized Bregman divergence with convex penalties: p_n ≈ n

4.1. Consistency, oracle property and hypothesis testing

4.2. Weight selection

5. Consistency of the penalized Bregman divergence classifier

6. Simulation study

6.1. Set-up

6.2. Penalized quasilikelihood for overdispersed count data

Table 1.

6.3. Penalized Bregman divergence for binary classification

Table 2.

7. Real data

Table 3.

Acknowledgments

Appendix

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Penalized Bregman divergence for large-dimensional regression and classification

Chunming Zhang

Yuan Jiang

Yi Chai

Summary

1. Introduction

2. The penalized Bregman divergence estimator

2.1. Bregman divergence

Fig. 1.

2.2. The model and penalized Bregman divergence estimator

3. Penalized Bregman divergence with nonconvex penalties: pn ≪ n

3.1. Consistency

3.2. Oracle property

3.3. Hypothesis testing

4. Penalized Bregman divergence with convex penalties: pn ≈ n

4.1. Consistency, oracle property and hypothesis testing

4.2. Weight selection

5. Consistency of the penalized Bregman divergence classifier

6. Simulation study

6.1. Set-up

6.2. Penalized quasilikelihood for overdispersed count data

Table 1.

6.3. Penalized Bregman divergence for binary classification

Table 2.

7. Real data

Table 3.

Acknowledgments

Appendix

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. Penalized Bregman divergence with nonconvex penalties: p_n ≪ n

4. Penalized Bregman divergence with convex penalties: p_n ≈ n