Skip to main content
Biometrika logoLink to Biometrika
. 2010 Jun 30;97(3):551–566. doi: 10.1093/biomet/asq033

Penalized Bregman divergence for large-dimensional regression and classification

Chunming Zhang 1, Yuan Jiang 1, Yi Chai 1
PMCID: PMC3372245  PMID: 22822248

Summary

Regularization methods are characterized by loss functions measuring data fits and penalty terms constraining model parameters. The commonly used quadratic loss is not suitable for classification with binary responses, whereas the loglikelihood function is not readily applicable to models where the exact distribution of observations is unknown or not fully specified. We introduce the penalized Bregman divergence by replacing the negative loglikelihood in the conventional penalized likelihood with Bregman divergence, which encompasses many commonly used loss functions in the regression analysis, classification procedures and machine learning literature. We investigate new statistical properties of the resulting class of estimators with the number pn of parameters either diverging with the sample size n or even nearly comparable with n, and develop statistical inference tools. It is shown that the resulting penalized estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but asymptotically does not rely on the complete specification of the underlying distribution. Furthermore, the choice of loss function in the penalized classifiers has an asymptotically relatively negligible impact on classification performance. We illustrate the proposed method for quasilikelihood regression and binary classification with simulation evaluation and real-data application.

Some key words: Consistency, Divergence minimization, Exponential family, Loss function, Optimal Bayes rule, Oracle property, Quasilikelihood

1. Introduction

Regularization is used to obtain well-behaved solutions to overparameterized estimation problems, and is particularly appealing in high dimensions. The topic is reviewed by Bickel & Li (2006). Regularization estimates a vector parameter of interest β ∈ 𝕉pn by minimizing the criterion function,

n(β)=Ln(β)+Pλn(β)(λn>0),

consisting of a data fit functional Ln, which measures how well β fits the observed set of data; a penalty functional Pλn, which assesses the physical plausibility of β; and a regularization parameter λn, which regulates the penalty. Depending on the nature of the output variable, the term Ln quantifies the error of an estimator by different error measures. For example, the quadratic loss function has nice analytical properties and is usually used in regression analysis. However, it is not always adequate in classification problems, where the misclassification loss, deviance loss, hinge loss for the support vector machine (Vapnik, 1996) and exponential loss for boosting (Hastie et al., 2001) are more realistic and commonly used in classification procedures.

Currently, most research on regularization methods is devoted to variants of penalty methods in conjunction with linear models and likelihood-based models in regression analysis. For linear model estimation with a fixed number p of parameters, Tibshirani (1996) introduced the L1-penalty for the proposed lasso method, where the quadratic loss is in use. Theoretical properties related to the lasso have been intensively studied; see Knight & Fu (2000), Meinshausen & Buhlmann (2006) and Zhao & Yu (2006). Zou (2006) mentioned that the lasso is in general not variable selection consistent, but the adaptive lasso via combining appropriately weighted L1-penalties is consistent. Huang et al. (2008) extended the results in Zou (2006) to high-dimensional linear models. Using the smoothly clipped absolute deviation penalty, Fan & Li (2001) showed that the penalized likelihood estimator achieved the oracle property: the resulting estimator is asymptotically as efficient as the oracle estimator. In their treatment, the number pn of model parameters is fixed at p, and the loss function equals the negative loglikelihood. Fan & Peng (2004) extended the result to pn diverging with n at a certain rate.

On the loss side, the literature on penalization methods includes much less discussion of either the role of the loss function in regularization for models other than linear or likelihood-based models, or the impact of different loss functions on classification performance. The least angle regression algorithm (Efron et al., 2004) for L1-penalization was developed for linear models using the quadratic loss. Rosset & Zhu (2007) studied the piecewise linear regularized solution paths for differentiable and piecewise quadratic loss functions with L1 penalty. It remains desirable to explore whether penalization methods using other types of loss functions can potentially benefit from the efficient least-angle regression algorithm. Moreover, theoretical results on the penalized likelihood are not readily translated into results for approaches, such as quasilikelihood (Wedderburn, 1974; McCullagh, 1983; Strimmer, 2003), where the distribution of the observations is unknown or not fully specified. Accordingly, a discussion of statistical inference for penalized estimation using a wider range of loss functions is needed.

In this study, we broaden the scope of penalization by incorporating loss functions belonging to the Bregman divergence class which unifies many commonly used loss functions. In particular, the quasilikelihood function and all loss functions mentioned previously in classification fall into this class. We introduce the penalized Bregman divergence by replacing the quadratic loss or the negative loglikelihood in penalized least-squares or penalized likelihood with Bregman divergence, and call the resulting estimator a penalized Bregman divergence estimator. Nonetheless, the Bregman divergence in general does not fulfill assumptions specifically imposed on the likelihood function associated with penalized likelihood.

We investigate new statistical properties of large-dimensional penalized Bregman divergence estimators, with dimensions dealt with separately in two cases:

CaseI:pnisdivergingwithn; (1)
CaseII:pnisnearlycomparablewithn. (2)

Zhang & Zhang (2010) give an application of the penalization method developed in this paper to estimating the hemodynamic response function for brain fMRI data where pn is as large as n. The current paper shows that the penalized Bregman divergence estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator, but the asymptotic distribution does not rely on the complete specification of the underlying distribution. From the classification viewpoint, our study elucidates the applicability and consistency of various classifiers induced by penalized Bregman divergence estimators. Technical details of this paper are in the online Supplementary Material.

2. The penalized Bregman divergence estimator

2.1. Bregman divergence

We give a brief overview of Bregman divergence. For a given concave function q with derivative q′, Bregman (1967) introduced a device for constructing a bivariate function,

Q(ν,μ)=q(ν)+q(μ)+(νμ)q(μ). (3)

Figure 1 displays Q and the corresponding q. It is readily seen that the concavity of q ensures the nonnegativity of Q. Moreover, for a strictly concave q, Q(ν, μ) = 0 is equivalent to ν = μ. However, since Q(ν, μ) is not generally symmetric in ν and μ, Q is not a metric or distance in the strict sense. Hence, we call Q the Bregman divergence and call q the generating function of Q. See Efron (1986), Lafferty et al. (1997), Lafferty (1999), Kivinen & Warmuth (1999), Grünwald & Dawid (2004), Altun & Smola (2006) and references therein.

Fig. 1.

Fig. 1

Illustration of Q(ν, μ) as defined in (3). The concave curve is q; the two dashed lines indicate locations of ν and μ; the solid straight line is q(μ) + (νμ)q(μ); the length of the vertical line with arrows at each end is Q(ν, μ).

The Bregman divergence is suitable for a broad array of error measures Q. For example, q(μ) = μ2 with some constant a yields the quadratic loss Q(Y, μ) = (Yμ)2. For a binary response variable Y, q(μ) = min{μ, (1 – μ)} gives the misclassification loss Q(Y, μ) = I {YI (μ > 1/2)}, where I (·) denotes the indicator function; q(μ) = −{μ log(μ) + (1 –μ) log(1 – μ)} gives the Bernoulli deviance loss Q(Y, μ) = −{Y log(μ) + (1 – Y) log(1 – μ)}; q(μ) = 2 min{μ, (1 – μ)} results in the hinge loss; and q(μ) = 2{μ(1 – μ)}1/2 yields the exponential loss Q(Y, μ) = exp[–(Y – 0.5) log{μ/(1 – μ)}].

Conversely, for a given Q, Zhang et al. (2009) provided necessary and sufficient conditions for Q being a Bregman divergence, and in that case derived an explicit formula for q. Applying this inverse approach from Q to q, they illustrated that the quasilikelihood function, the Kullback–Leibler divergence or the deviance loss for the exponential family of probability functions, and many margin-based loss functions (Shen et al., 2003) are Bregman divergences. To our knowledge, there is little theoretical work in the literature on thoroughly examining the penalized Bregman divergence, via methods of regularization, for large-dimensional model building, variable selection and classification problems.

2.2. The model and penalized Bregman divergence estimator

Let (X, Y) denote a random realization from some underlying population, where X = (X1, …, Xpn)T is the input vector and Y is the output variable. The dimension pn follows the assumption in (1) or (2). We assume the parametric model,

m(x)=E(Y|X=x)=F1(b0;0+xTβ0), (4)

where F is a known link function, b0;0 ∈ 𝕉1 and β0 = (β1;0, …, βpn;0)T ∈ 𝕉pn are the unknown true parameters. Throughout the paper, it is assumed that some entries in β0 are exactly zero. Write β0={β0(I)T,β0(II)T}T, where β0(I) collects all nonzero coefficients, and β0(II)=0.

Our goal is to estimate the true parameters via penalization. Let {(X1, Y1), …, (Xn, Yn)} be a sample of independent random pairs from (X, Y), where Xi = (Xi1, …, Xipn)T. The penalized Bregman divergence estimator (0, β̂) is defined as the minimizer of the criterion function,

n(b0,β)=1ni=1nQ{Yi,F1(b0+XiTβ)}+j=1pnPλn(|βj|), (5)

where β = (β1, …, βpn)T, the loss function Q(·, ·) is a Bregman divergence, and Pλn (·) represents a nonnegative penalty function indexed by a tuning constant λn > 0. Set β̃ = (b0, βT)T, and correspondingly X˜i=(1,XiT)T. Then (5) can be written as

n(β˜)=1ni=1nQ{Yi,F1(X˜iTβ˜)}+j=1pnPλn(|βj|). (6)

The penalized Bregman divergence estimator is β̃E = (0, β̂1, …, β̂pn)T = arg minβ̃n(β̃).

Regarding the uniqueness of β̃E, assume that the quantities

qj(y;θ)=(j/θj)Q{y,F1(θ)}(j=0,1,), (7)

exist finitely up to any order required. Provided that for all θ ∈ 𝕉 and all y in the range of Y,

q2(y;θ)>0, (8)

it follows that Ln(β˜)=n1i=1nQ{Yi,F1(X˜iTβ˜)} in (6) is convex in β̃. In that case, if convex penalties are used in (6), then ℓn(β̃) is necessarily convex in β̃, and hence the local minimizer β̃E is the unique global penalized Bregman divergence estimator. For nonconvex penalties, however, the local minimizer may not be globally unique.

3. Penalized Bregman divergence with nonconvex penalties: pnn

3.1. Consistency

We start by introducing some notation. Let sn denote the number of nonzero coordinates of β0, and set β˜0=(b0;0,β0T)T. Define

an=maxj=1,,sn|Pλn(|βj;0|)|,bn=maxj=1,,sn|Pλn(|βj;0|)|,

where Pλ(j)(|β|) is shorthand for (dj/dxj)Pλ(x)|x=|β|, j = 1, 2. Unless otherwise stated, ‖·‖ denotes the L2-norm. Theorem 1 guarantees the existence of a consistent local minimizer for (6), and states that the local penalized Bregman divergence estimator β̃E is (n/pn)1/2-consistent.

Theorem 1 (Existence and consistency). Assume Condition A in the Appendix, an = O(1/n1/2) and bn = o(1). If pn4/n0, (pn/n)1/2n → 0 and minj=1,...,sn |βj;0|/λn → ∞ as n → ∞, then there exists a local minimizer β̃E of (6) such thatβ̃Eβ̃0‖ = OP {(pn/n)1/2}.

3.2. Oracle property

Following Theorem 1, the oracle property of the local minimizer is given in Theorem 2 below. Before stating it, we need some notation. Write X = (X(I)T, X(II)T)T, (I) = (1, X(I)T)T, and β̃(I) = (b0, β(I)T)T. For the penalty term, let

dn={0,Pλn(|β1;0|)sign(β1;0),,Pλn(|βsn;0|)sign(βsn;0)}T,Σn=diag{0,Pλn(|β1;0|),,Pλn(|βsn;0|)}.

For the q function, define Fn = q(2){m(X)}/[F(1){m(X)}]2(I)(I)T and

Ωn=E[var(Y|X)q(2){m(X)}Fn],Hn=E(Fn).

Theorem 2 (Oracle property). Assume Condition B in the Appendix.

  1. If pn2/n=O(1), (pn/n1/2)/λn → 0 and lim infn→∞ lim infθ→0+ Pλn (θ)/λn > 0 as n → ∞, then any (n/ pn)1/2-consistent local minimizer β˜E=(β˜E(I)T,β^(II)T)T satisfies pr(β̂(II) = 0) → 1.

  2. Moreover, if an = O(1/n1/2), pn5/n0 and minj=1,...,sn|βj;0|/λn → ∞, then for any fixed integer k and any k × (sn + 1) matrix An such that An AnTG with G being a k × k nonnegative-definite symmetric matrix, n1/2AnΩn1/2{(Hn+Σn)(β˜E(I)+β˜0(I))+dn}N(0,G) in distribution.

Theorem 2 has some useful consequences: First, the pn-dimensional penalized Bregman divergence estimator, combined with appropriate penalties, achieves the same oracle property as the penalized likelihood estimator of Fan & Peng (2004): the estimators of the zero parameters take exactly zero values with probability tending to 1, and the estimators of the nonzero parameters are asymptotically normal with the same means and variances as if the zero coefficients were known in advance. Second, the asymptotic distribution of the penalized Bregman divergence estimator relies on the underlying distribution of Y | X through E(Y | X) and var(Y | X), but does not require a complete specification of the underlying distribution. Third, the asymptotic distribution depends on the choice of the Q-loss only through the second derivative of its generating q function. This enables us to evaluate the impact of loss functions on the penalized Bregman divergence estimators and to derive an optimal loss function in certain situations.

According to Theorem 2, the asymptotic covariance matrix of β˜E(I) is Vn = (Hn + Σn)−1 Ωn(Hn + Σn)−1. In practice, Vn is unknown and needs to be estimated. Typically, the sandwich formula can be exploited to form an estimator of Vn by

V^n=(H^n+Σ^n)1Ω^n(H^n+Σ^n)1, (9)

where Ω^n=n1i=1nq12(Yi;X˜i(I)Tβ˜E(I))X˜i(I)X˜i(I)T, H^n=n1i=1nq2(Yi;X˜i(I)Tβ˜E(I))X˜i(I)X˜i(I)T and Σ̂n = diag{0, Pλn (|β̂1|), …, Pλn (|β̂sn|)}.

Proposition 1 below demonstrates that for any (n/ pn)1/2-consistent estimator β˜E(I) of β˜0(I), n is a consistent estimator for the covariance matrix Vn, in the sense that An(V^nVn)AnT0 in probability for any k × (sn + 1) matrix An satisfying AnAnTG, where k is any fixed integer.

Proposition 1 (Covariance matrix estimation). Assume Condition B in the Appendix, and bn = o(1). If pn4/n0, (pn/n)1/2n → 0 and minj=1,...,sn |βj;0|/λn → ∞ as n → ∞, then for any β˜E(I)β˜0(I)=OP{(pn/n)1/2}, we have that An(V^nVn)AnT0 in probability for any k × (sn + 1) matrix An satisfying AnAnTG, where G is a k × k matrix.

Is there an optimal choice of q such that the corresponding Vn matrix achieves its lower bound? We have that Vn=Hn1ΩnHn1 in two special cases. One is Σn = 0 for large n and large minj=1,...,sn |βj;0|, which results from the smoothly clipped absolute deviation and hard thresholding penalties; another one is Σn = 0 for all n, which results from the weighted L1-penalties in Theorem 6 below. In these cases, it can be shown via matrix algebra that the optimal q satisfies the generalized Bartlett identity in (11) below. On the other hand, for an arbitrary Σn ≠ 0, the complication rises; the optimal q is generally not available in closed-form.

3.3. Hypothesis testing

We consider hypothesis testing about β˜0(I) formulated as

H0:Anβ˜0(I)=0versusH1:Anβ˜0(I)0, (10)

where An is a given k × (sn + 1) matrix such that AnAnT=G with G being a k × k positive-definite matrix. This form of linear hypothesis allows one to test simultaneously whether a subset of variables used are statistically significant by taking some specific form of the matrix An; for example, An = [Ik, 0k, sn + 1–k] yields AnAnT=Ik.

We propose a generalized Wald-type test statistic of the form

Wn=n(Anβ˜E(I))T(AnH^n1Ω^nH^n1AnT)1(Anβ˜E(I)),

where Ω̂n and Ĥn are as defined in (9). This test is asymptotically distribution-free, as Theorem 3 justifies that, under the null, Wn would for large n be distributed as χk2.

Theorem 3 (Wald-type test under H0). Assume Condition C in the Appendix, and let an = o{1/(nsn)1/2} and bn=o(1/pn1/2). If pn5/n0, (pn/n)1/2n → 0 and minj=1,...,sn|β j;0|/λn → ∞ as n → ∞, then under H0 in (10), Wnχk2 in distribution.

Remark 1. To appreciate the discriminating power of Wn in assessing the significance, the asymptotic power can be analyzed. It can be shown that under H1 in (10) where ‖Anβ̃0‖ is independent of n, Wn → +∞ in probability at the rate n. Hence Wn has power function tending to 1 against fixed alternatives. Besides, Wn has a nontrivial local power detecting contiguous alternatives approaching the null at the rate n−1/2. We omit the lengthy details.

In the context of penalized likelihood estimator β̃E, Fan & Peng (2004) showed that the likelihood-ratio-type test statistic

Λn=2n{minβ˜pn+1:Anβ˜(I)=0n(β˜)n(β˜E)}

follows an asymptotic χ2 distribution under the null hypothesis. Theorem 4 below explores the extent to which this result can feasibly be extended to Λn constructed from the broad class of penalized Bregman divergence estimators.

Theorem 4 (Likelihood-ratio-type test under H0). Assume (8) and Condition D in the Appendix, an = o{1/(nsn)1/2} and bn=o(1/pn1/2). If pn5/n0, (pn/n)1/2n → 0 and minj=1,..., sn |βj;0|/λn → ∞ as n∞, then under H0 in (10), provided that q satisfies the generalized Bartlett identity,

q(2){m()}=cvar(Y|X=), (11)

for a constant c > 0, we have that Λn/cχk2 in distribution.

Curiously, the result in Theorem 4 indicates that in general, condition (11) on q restricts the application domain of the test statistic Λn. For instance, in the case of binary responses, the Bernoulli deviance loss satisfies (11), but the quadratic loss and exponential loss violate (11). This limitation reflects that the likelihood-ratio-type test statistic Λn may not be straightforwardly valid for the penalized Bregman divergence estimators.

Remark 2. For a Bregman divergence Q, condition (11) with c = 1 is equivalent to the equality E[∂2Q{Y, m(·)}/∂m(·)2 | X = ·] = E([∂ Q{Y, m(·)}/ ∂m(·)]2 | X = ·), which includes the Bartlett identity (Bartlett, 1953) as a special case, when Q is the negative loglikelihood. Thus, we call (11) the generalized Bartlett identity. It is also seen that the quadratic loss satisfies (11) for homoscedastic regression models even without knowing the error distribution.

4. Penalized Bregman divergence with convex penalties: pnn

4.1. Consistency, oracle property and hypothesis testing

For the nonconvex penalties discussed in § 3, the condition pn4/n0 or pn5/n0 can be relaxed to pn3/n0 in the particular situation where the Bregman divergence is a quadratic loss and the link is an identity link. It remains unclear whether pn can be relaxed in other cases.

This section aims to improve the rate of consistency of the penalized Bregman divergence estimators and to relax conditions on pn using certain convex penalties, the weighted L1-penalties, under which the penalized Bregman divergence estimator β̃E = (0, β̂T)T is defined to minimize the criterion function,

n(β˜)=1ni=1nQ{Yi,F1(X˜iTβ˜)}+λnj=1pnwj|βj|, (12)

with w1, …, wpn representing nonnegative weights. Define

wmax(I)=maxj=1,,snwj,wmin(I)=minsn+1jpnwj.

Lemma 1 obtains the existence of a (n/pn)1/2-consistent local minimizer of (12). This rate is identical to that in Theorem 1 but, unlike Theorem 1, Lemma 1 includes the L1-penalty. Other results parallel to those in § 3 can similarly be obtained.

Lemma 1 (Existence and consistency). Assume Conditions A1–A7 in the Appendix and wmax(I)=OP{1/(λnn1/2)}. If pn4/n0 as n → ∞, then there exists a local minimizer β̃E of (12) such that ||β̃Eβ̃0|| = OP {(pn/n)1/2}.

Lemma 1 imposes a condition on the weights of nonzero coefficients alone, but ignores the weights on zero coefficients. Theorem 5 below reflects that incorporating appropriate weights to the zero coefficients can improve the rate of consistency from (pn/n)1/2 to (sn/n)1/2.

Theorem 5 (Existence and consistency). Assume Conditions A1–A7 in the Appendix, wmax(I)=OP{1/(λnn1/2)} and there exists a constant M ∈ (0, ∞) such that limnpr(wmin(II)λn>M)=1. If sn4/n0 and sn(pnsn) = o(n), then there exists a local minimizer β̃E of (12) such that ||β̃Eβ̃0|| = OP {(sn/n)1/2}.

More importantly, conditions on the dimension pn are much relaxed. For example, Theorem 5 allows pn = o(n(3+δ)/(4+δ)) for any δ > 0, provided sn = O(n1/(4+δ)), whereas Theorem 1 requires pn = o(n1/4) for any snpn. This implies that pn can indeed be relaxed to the case (2) of being nearly comparable with n. On the other hand, the proof of Theorem 5 relies on the flexibility of the weights {wj}, as seen in an I2,1(II) term. Thus, directly carrying the proof of Theorem 5 through to either the nonconvex penalties in Theorem 1 or the L1-penalty is not feasible.

Theorem 6 gives an oracle property for the (n/sn)1/2-consistent local minimizer.

Theorem 6 (Oracle property). Assume Conditions A1, A2, B3, A4, B5, A6–A7 in the Appendix.

  1. If sn2/n=O(1) and wmin(II)λnn1/2/(snpn)1/2 in probability as n → ∞, then any (n/sn)1/2-consistent local minimizer β˜E=(β˜E(I)T,β^(II)T)T satisfies pr(β̂(II) = 0) → 1.

  2. Moreover, if wmax(I)=OP{1/(λnn1/2)}, sn5/n0 and minj=1,...,sn |βj;0|/(sn/n)1/2 → ∞, then for any fixed integer k and any k × (sn + 1) matrix An such that AnAnTG with G being a k×k nonnegative-definite symmetric matrix, n1/2AnΩn1/2{Hn(β˜E(I)β˜0(I))+λnWnsign(β˜0(I))}N(0,G) in distribution, where Wn = diag(0, w1, …, wsn) and sign{β˜0(I)}={sign(b0;0),,sign(βsn;0)}T.

For testing hypotheses of the form (10), the generalized Wald-type test statistic Wn proposed in § 3.3 continues to be applicable. Theorem 7 derives the asymptotic distribution of Wn.

Theorem 7 (Wald-type test under H0). Assume Conditions A1, A2, B3, C4, B5, A6A7 in the Appendix, and that wmax(I)=oP[1/{λn(nsn)1/2}]. If sn5/n0 and minj=1,..., sn |βj;0|/(sn/n)1/2 → ∞ as n → ∞, then under H0 in (10), Wnχk2 in distribution.

4.2. Weight selection

We propose a penalized componentwise regression method for selecting weights by

w^j=|β^jPCR|1(j=1,,pn), (13)

based on some initial estimator, β^PCR=(β^1PCR,,β^pnPCR)T, minimizing

nPCR(β)=1ni=1nj=1pnQ{Yi,F1(Xijβj)}+κnj=1pn|βj|, (14)

with some sequence κn > 0. Theorem 8 indicates that under assumptions on the correlation between the predictor variables and the response variable, the weights selected by the penalized componentwise regression satisfy the conditions in Theorem 5.

Theorem 8 (Penalized componentwise regression for weights: pnn). Assume Conditions A1, A2, B3, A4, A6, A7 and E. Assume that in Condition E, 𝒜n = λnn1/2, 𝒜n/κn → ∞ and 𝒝n/κn = O(1) for κn in (14). Suppose λnn1/2 = O(1), λn = o(κn) and log(pn)=o(nκn2). Assume that E(X) = 0 in model (4). Then there exist local minimizers β^jPCR (j = 1, …, pn), of (14) such that the weights ŵj (j = 1, …, pn), defined in (13) satisfy that w^max(I)=OP{1/λnn} and w^min(II)λn in probability in Theorem 5, where w^max(I)=maxj=1,,snw^j and w^min(II)=minsn+1jpnw^j.

5. Consistency of the penalized Bregman divergence classifier

This section deals with the binary response variable Y, which takes values 0 and 1. In this case, the mean regression function m(x) in (4) becomes the class label probability, pr(Y = 1 | X = x). From the penalized Bregman divergence estimator (0, β̂T)T proposed in either § 3 or § 4, we can construct the penalized Bregman divergence classifier, ϕ̂ (x) = I {(x) > 1/2}, for a future input variable x, where (x) = F−1(0 + xTβ̂).

In the classification literature, the misclassification loss of a classification rule ϕ at a sample point (x, y) is l{y, ϕ(x)} = I {yϕ(x)}. The risk of ϕ is the expected misclassification loss, R(ϕ) = E[l{Y, ϕ(X)}] = pr{ϕ(X) ≠ Y}. The optimal Bayes rule, which minimizes the risk with respect to ϕ, is ϕB(x) = I {m(x) > 1/2}. For a test sample (Xo, Yo), which is an independent and identically distributed copy of samples in the training set 𝒯n = {(Xi, Yi), i = 1, …, n}, the optimal Bayes risk is then R(ϕB) = pr{ϕB(Xo) ≠Yo}. Meanwhile, the conditional risk of the penalized Bregman divergence classification rule ϕ̂ is R(ϕ̂) = pr{ϕ̂(Xo) ≠ Yo | 𝒯n}. For ϕ̂ induced by the penalized Bregman divergence regression estimation using a range of loss functions combined with either the smoothly clipped absolute deviation, L1 or weighted L1-penalties, Theorem 9 verifies the classification consistency attained by ϕ̂.

Theorem 9 (Consistency of the penalized Bregman divergence classifier). Assume Conditions A1 and A4 in the Appendix. Suppose that ||β̃Eβ̃ 0|| = OP (rn). If rnpn1/2=o(1), then the classification rule ϕ̂ constructed from β̃E is consistent in the sense that E { R(ϕ̂)} – R(ϕB) → 0 as n → ∞.

6. Simulation study

6.1. Set-up

For illustrative purposes, four procedures for penalized estimators are compared: (I) the smoothly clipped absolute deviation penalty, with an accompanying parameter a = 3.7, combined with the local linear approximation; (II) the L1 penalty; (III) the weighted L1-penalties with weights selected by (13); and (IV) the oracle estimator using the set of significant variables. Throughout the numerical work in the paper, methods (I)–(III) utilize the least angle regression algorithm, F is the log link for count data and the logit link for binary response variables.

6.2. Penalized quasilikelihood for overdispersed count data

A quasilikelihood function Q relaxes the distributional assumption on a random variable Y via the specification ∂Q(Y, μ)/∂μ = (Yμ)/ V(μ), where var(Y | X = x) = V {m(x)} for a known continuous function V (·) > 0. Zhang et al. (2009) verified that the quasilikelihood function belongs to the Bregman divergence and derived the generating q function,

q(μ)=μsμV(s)ds. (15)

We generate overdispersed Poisson counts Yi satisfying var(Yi | Xi = xi) = 2m(xi). In the predictor Xi = (Xi1, …, Xipn)T, pn = n/8, n/2 and n – 10, and Xi1 = i/n – 0.5. For j = 2, …, pn, Xij = Φ(Zij) – 0.5, where Φ is the standard normal distribution function, and (Zi2, …, Zipn)TN{0, ρ1ρn11pn1T+(1ρ)Iρn1}, with 1d a d × 1 vector of ones and Id a d × d identity matrix. Thus (Xi2, …, Xipn) are correlated Un(0, 1) if ρ ≠ 0. The link function is log{m(x)} = b0;0 + xTβ0, where b0;0 = 5 and β0 = (2, 2, 0, 0, …, 0)T.

First, to examine the effect of penalized regression estimates on model fitting, we generate 200 training sets of size n. For each training set, the model error is calculated by l=1L{m^(xl)m(xl)}2/L, at a randomly generated sequence {xl}l=1L=5000, and the relative model error is the ratio of model error using penalized estimators and that using nonpenalized estimators. The tuning constants λn for the training set in each simulation for methods (I)–(II) are selected separately by minimizing the quasilikelihood on a test set of size equal to that of the training set; λn and κn for method (III) are searched on a surface of grid points. The mean relative model error can be obtained from those 200 training sets. Table 1 summarizes the penalized quasilikelihood estimates of parameters by means of (15). It is clearly seen that if the true model coefficients are sparse, the penalized estimators reduce the function estimation error compared with the nonpenalized estimators.

Table 1.

Simulation results from penalized quasilikelihood estimates, with dependent predictors. n = 200, ρ = 0.2

Loss pn Method Regression Variable selection
MRME CZ (sd) IZ (sd)
Quasilikelihood n/8 SCAD 0.2428 17.74 (5.46) 0 (0)
L1 0.3503 14.21 (4.91) 0 (0)
Weighted L1 0.1077 21.32 (2.48) 0 (0)
Oracle 0.0861 23
Quasilikelihood n/2 SCAD 0.0409 91.73 (12.56) 0 (0)
L1 0.0712 88.00 (14.94) 0 (0)
Weighted L1 0.0161 94.84 (5.89) 0 (0)
Oracle 0.0105 98
Quasilikelihood n – 10 SCAD 0.0010 184.37 (8.52) 0 (0)
L1 0.0019 181.13 (13.87) 0 (0)
Weighted L1 0.0004 185.25 (4.97) 0 (0)
Oracle 0.0002 188

SCAD, smoothly clipped absolute deviation; MRME, mean of relative model errors obtained from the training sets; CZ, average number of coefficients that are correctly estimated to be zero when the true coefficients are zero; IZ, average number of coefficients that are incorrectly estimated to be zero when the true coefficients are nonzero; sd, standard deviation.

Second, to study the utility of penalized estimators in revealing the effects in variable selection under quasilikelihood, Table 1 gives the average number of coefficients that are correctly estimated to be zero when the true coefficients are zero, and the average number of coefficients that are incorrectly estimated to be zero when the true coefficients are nonzero. The standard deviations of the corresponding estimations across 200 training sets are given in brackets. Overall, the penalized estimators help yield a sparse solution and build a sparse model. These results lend support to the theoretical results in § 3 and § 4.

In summary, the smoothly clipped absolute deviation and weighted L1 penalties outperform the L1 penalty in terms of regression estimation and variable selection. As expected, the oracle estimator, which is practically infeasible, performs better than the three penalized estimators.

6.3. Penalized Bregman divergence for binary classification

We generate data with two-classes from the model,

X=(X1,,Xpn)T~N(0,Σ),Y|X=x~Ber{m(x)},

where pn = n/8, n/2, n – 10, Σ=ρ1ρn1pnT+(1ρ)Iρn and logit{m(x)} = b0;0 + xT β0 with b0;0 = 3 and β0 = (1.5, 2, −2, −2.5, 0, 0, …, 0)T. Table 2 summarizes the penalized estimates of parameters. The results reinforce the conclusion drawn in § 6.2.

Table 2.

Simulation results from penalized Bregman divergence estimates for binary classification, with dependent predictors. n = 200, ρ = 0.2

Loss pn Method Regression Variable selection Classification
MRME CZ (sd) IZ (sd) MAMR
Deviance n/8 SCAD 0.2504 18.86 (4.37) 0.01 (0.10) 0.1153
L1 0.3774 11.31 (5.48) 0.00 (0.00) 0.1218
Weighted L1 0.2409 18.11 (2.26) 0.01 (0.10) 0.1160
Oracle 0.1164 21 0 0.1042
Exponential n/8 SCAD 0.2566 18.92 (4.13) 0.00 (0.00) 0.1162
L1 0.3356 12.28 (5.54) 0.00 (0.00) 0.1232
Weighted L1 0.2176 19.07 (1.66) 0.01 (0.10) 0.1175
Oracle 0.1276 21 0 0.1042
Deviance n/2 SCAD 0.0612 94.74 (2.32) 0.03 (0.17) 0.1166
L1 0.1148 76.39 (12.97) 0.00 (0.00) 0.1313
Weighted L1 0.0782 89.00 (6.38) 0.04 (0.19) 0.1235
Oracle 0.0240 96 0 0.1043
Exponential n/2 SCAD 0.0915 94.37 (2.91) 0.05 (0.21) 0.1209
L1 0.1141 76.05 (11.99) 0.00 (0.00) 0.1315
Weighted L1 0.0723 90.60 (4.70) 0.04 (0.19) 0.1222
Oracle 0.0310 96 0 0.1043
Deviance n – 10 SCAD 0.0230 185.09 (1.53) 0.02 (0.14) 0.1136
L1 0.0847 158.19 (17.26) 0.00 (0.00) 0.1401
Weighted L1 0.0539 176.51 (8.17) 0.03 (0.17) 0.1273
Oracle 0.0121 186 0 0.1044
Exponential n – 10 SCAD 0.0360 184.62 (2.20) 0.01 (0.10) 0.1170
L1 0.0746 161.15 (14.73) 0.00 (0.00) 0.1386
Weighted L1 0.0489 178.70 (5.91) 0.04 (0.19) 0.1271
Oracle 0.0150 186 0 0.1044

MAMR, mean of the average misclassification rates calculated from training sets.

Moreover, to investigate the performance of penalized classifiers, we evaluate the average misclassification rate for 10 independent test sets of size 10 000. Table 2 reports the mean of the average misclassification rates calculated from 100 training sets. Evidently, all penalized classifiers perform as well as the optimal Bayes classifier. This agrees with results of Theorem 9 on the asymptotic classification consistency. Furthermore, the choice of loss functions in the penalized classifiers has an asymptotically relatively negligible impact on classification performance.

7. Real data

The Arrhythmia dataset (Güvenir et al., 1997) consists of 452 patient records in the diagnosis of cardiac arrhythmia. Each record contains 279 clinical measurements, from electrocardiography signals and other information such as sex, age and weight, along with the decision of an expert cardiologist. In the data, class 01 refers to normal electrocardiography, class 02–class 15 each refers to a particular type of arrhythmia, and class 16 refers to the unclassified remainder.

We intend to predict whether a patient can be categorized as having normal electrocardiography or not. After deleting missing values and class 16, the remaining 430 patients with 257 attributes are used in the classification. To evaluate the performance of the penalized estimates of model parameters in logit{pr(Y=1|X1,,X257)}=b0+j=1257βjXj, we randomly split the data into a training set and a test set in the ratio 2:1. For each training set, the tuning constant is selected by minimizing a 3-fold crossvalidated estimate of the misclassification rate; λn and κn for the penalized componentwise regression are found on a grid of points. We calculate the mean of the misclassification rates and the average number of selected variables over 100 random splittings. It is seen from Table 3 that the penalized classifier using the deviance loss and that using the exponential loss have similar values of misclassification rates. In contrast, the nonpenalized classifiers select all attributes, yielding much higher misclassification rates.

Table 3.

Arrhythmia data: mean misclassification rate and the average number of selected variables

Loss Method MMR # Selected variables
Deviance Nonpenalized 0.4265 257.00
SCAD 0.2550 16.13
L1 0.2358 45.46
Weighted L1 0.2340 26.44
Exponential Nonpenalized 0.4323 257.00
SCAD 0.2666 15.83
L1 0.2397 43.79
Weighted L1 0.2366 18.77

MMR, mean of the misclassification rates.

Acknowledgments

The authors thank the editor, associate editor and two referees for insightful comments and suggestions. The research was supported by grants from the National Science Foundation and National Institutes of Health, U.S.A.

Appendix

For a matrix M, its eigenvalues, minimum eigenvalue, maximum eigenvalue and trace are labelled by λj (M), λmin(M), λmax(M) and tr(M), respectively. Let ‖M‖ = supx‖=1Mx‖ = {λmax(MTM)}1/2 be the matrix L2 norm; let ‖MF = {tr(MTM)}1/2 be the Frobenius norm. See Golub & Van Loan (1996) for details. Throughout the proof, C is used as a generic finite constant.

We first impose some regularity conditions, which are not the weakest possible.

Condition A consists of the following.

  • A1. Assume supn1β˜0(I)1< and ‖X is bounded;

  • A2. the matrix E(X̃X̃T) exists and is nonsingular;

  • A3. assume E(Y2) < ∞;

  • A4. there is a large enough open subset of 𝕉pn +1, which contains the true parameter point β̃0, such that F−1(Tβ̃) is bounded for all β̃ in the subset;

  • A5. the eigenvalues of the matrix −E(q(2){m(X)}/[F(1){m(X)}]2X̃X̃T) are uniformly bounded away from 0;

  • A6. the function q(4)(·) is continuous, and q(2)(·) < 0;

  • A7. the function F(·) is a bijection, F(3)(·) is continuous and F(1)(·) ≠0; and finally

  • A8. assume Pλn (0) = 0. There are constants C and D such that when θ1 > Cλn and θ2 > Cλn, |Pλn (θ1) − Pλn (θ2)| ⩽ D|θ1θ2|.

Condition B: These are identical to Condition A except that A3 and A5 are replaced by B3 and B5:

  • B3. there exists a constant C ∈ (0, ∞) such that E{|Ym(X)|j} ⩽ j!Cj for all j ⩾ 3. Also, infn ⩾ 1, 1 ⩽ jpn E{var(Y|X)Xj2}>0; and

  • B5. assume λjn) and λj (Hn) are uniformly bounded away from 0; Hn1Ωn is bounded away from ∞.

Condition C: These are identical to Condition B except that B4 is replaced by:

  • C4. there is an open subset of 𝕉pn +1 which contains the true parameter point β̃0, such that F−1(Tβ̃) is bounded for all β̃ in the subset. Moreover, the subset contains the origin.

Condition D: This is identical to Condition C except that C5 is replaced by:

  • D5. assume λj (Hn) are uniformly bounded away from 0; Hn1/2Ωn1/2 is bounded away from ∞.

Condition E is as follows.

E1. Assume minj=1,..., sn |E(XjY)| ≽ 𝒜n and maxsn+1 ⩽ jpn |E(Xj Y)| = o(𝒝n) for some positive sequences 𝒜n and 𝒝n, where sntn, for two nonnegative sequences sn and tn, denotes that there exists a constant c > 0 such that snc tn for all n ⩾ 1.

Proof of Theorem 1. Let rn = (pn/n)1/2 and ũ = (u0, u1, …, upn)T ∈ 𝕉pn+1. Similar to Fan & Peng (2004), it suffices to show that for any given > 0, there is a large constant C such that, for large n,

pr{infu˜=Cn(β˜0+rnu˜)>n(β˜0)}1. (A1)

Define β̃L = β̃0 + rnũ. To show (A1), consider

Dn(u˜)=1ni=1n[Q{Yi,F1(X˜iTβ˜L)}Q{Yi,F1(X˜iTβ˜0)}]+j=1pn{Pλn(|βj;0+rnuj|)Pλn(|βj;0|)}I1+I2. (A2)

First, we consider I1. For μ = F−1(θ), obtain qj (y; θ) (j = 1, 2, 3), from (7). By Taylor’s expansion,

I1=I1,1+I1,2+I1,3, (A3)

where I1,1=rn/ni=1nq1(Yi;X˜iTβ˜0)X˜iTu˜, I1,2=rn2/(2n)i=1nq2(Yi;X˜iTβ˜0)(X˜iTu˜)2 and I1,3=rn3/(6n)i=1nq3(Yi;X˜iTβ˜*)(X˜iTu˜)3 for β̃* located between β̃0 and β̃0 + rnũ. Hence |I1,1| ⩽ OP {rn(pn/n)1/2}‖ũ‖ and I1,2=(rn2/2)u˜TE(q(2){m(X)}/[F(1){m(X)}]2X˜X˜T)u˜+OP(rn2pn/n1/2)u˜2. Conditions A1 and A4 give |I1,3|OP(rn3pn3/2)u˜3.

Next, we consider I2. By Taylor’s expansion, I2rnj=1snPλn'(|βj;0|)sign(βj;0)uj+(rn2/2)j=1snPλn(2)(|βj*|)uj2I2,1+I2,2, for βj* between βj;0 and βj;0 + rnuj. Thus, |I2,1| ⩽ rnanu(I)1 and |I2,2|rn2bnu(I)2+Drn3u(I)3, where u(I) = (u1, …, usn)T. Since pn4/n0, we can choose some large C such that I1,1, I1,3, I2,1 and I2,2 are all dominated by I1,2, which is positive by Condition A5. This implies (A1).

Proof of Lemma 1. Analogous to the proof of Theorem 1, it suffices to show (A1). Note that (A2) continues to hold with I2λnj=1pnwj(|βj;0+rnuj||βj;0|) and I1 is unchanged. Clearly, I2λnrnj=1snwj|uj|I2,1, in which |I2,1|λnrnwmax(I)u(I)1. The rest of the proof resembles that of Theorem 1 and is omitted.

Proof of Theorem 5. Write ũ = {ũ(I)T, u(II)T}T, where ũ(I) = (u0, u1, …, usn)T and u(II) = (usn+1, …, upn)T. Following the proof of Lemma 1, it suffices to show (A1) for rn = (sn/n)1/2.

For I1,1 in (A3), I1,1=I1,1(I)+I1,1(II) according to ũ(I) and u(II). It follows that |I1,1(I)|rnOP{(sn/n)1/2}u˜(I)2 and |I1,1(I)|rnOP(1/n1/2)u(II)1.

For I1,2 in (A3), similar to the proof of Theorem 1, I1,2 = I1,2,1 + I1,2,2. Define di = q(2){m(Xi)}/[F(1){m(Xi)}]2. This yields

I1,2,1rn22ni=1ndi(Xi(I)Tu˜(I))2rn2ni=1ndi(Xi(I)Tu˜(I))(Xi(II)Tu(II))=I1,2,1(I)I1,2,1(cross).

Then there exists a constant C > 0 such that I1,2,1(I)Crn2{1+oP(1)}u˜(I)22 and |I1,2,1(cross)|OP(rn2sn1/2)u˜(I)2u(II)1. For I1,2,2, partitioning ũ into ũ(I) and u(II) gives

I1,2,2I1,2,2(I)+I1,2,2(cross)+I1,2,2(II),

where |I1,2,2(I)|rn2OP(sn/n1/2)u˜(I)22, |I1,2,2(cross)|rn2OP{(sn/n)1/2}u˜(I)2u(II)1 and |I1,2,2(II)|rn2OP(n1/2)u(II)12.

For I1,3 in (A3), since snpn = o(n), ‖β̃*1 is bounded and thus |I1,3|OP(rn3)u˜(I)13+OP(rn3)u(II)13I1,3(I)+I1,3(II), where |I1,3(I)|OP(rn3sn3/2)u˜(I)23 and |I1,3(II)|OP(rn3)u(II)13.

For I2 in (A2), I2I2,1(I)+I2,1(II), where I2,1(I)=λnrnj=1snwj|uj| and I2,1(II)=λnrnj=sn+1pnwj|uj|. Hence |I2,1(I)|λnrnwmax(I)sn1/2u(I)2 and I2,1(II)λnrnwmin(II)u(II)1.

It can be shown that either I1,2,1(I) or I2,1(II) dominates all other terms in groups, 𝒢1=(I1,2,2(I),I1,3(I)), 𝒢2=(I1,1(II),I1,2,2(II),I1,3(II),I1,2,1(cross),I1,2,2(cross)) and 𝒢3=(I1,1(I),I2,1(I)). Namely, I1,2,1(I) dominates 𝒢1, and I2,1(II) dominates 𝒢2. For 𝒢3, if ‖u(II)1C/2, then 𝒢3 is dominated by I1,2,1(I), which is positive; if ‖u(II)1 > C /2, then 𝒢3 is dominated by I2,1(II), which is positive.

Proof of Theorem 8. Minimizing (14) is equivalent to minimizing n,jPCR(α)=n1i=1nQ{Yi,F1(Xijα)}+κn|α|, for j = 1, …, pn. The proof may be separated into two parts.

Part 1. To show w^max(I)=OP{1/(λnn1/2)}, it suffices to show that for 𝒜n = λnn1/2, there exist local minimizers β^jPCR of n,jPCR(α) such that limδ0infn1pr(min1jsn|β^jPCR|>𝒜nδ)=1. It suffices to prove that for j = 1, …, sn there exist some bj with |bj | = 2δ such that

limδ0infn1pr[min1jsn{inf|α|δn,jPCR(𝒜nα)n,jPCR(𝒜nbj)}>0]=1, (A4)

and there exists some large enough Cn > 0 such that

limδ0infn1pr[min1jsn{inf|α|Cnn,jPCR(𝒜nα)n,jPCR(𝒜nbj)}>0]=1. (A5)

Note that (A5) holds, since for every n ⩾ 1, when |α| → ∞, min1jsn{n,jPCR(𝒜nα)n,jPCR(𝒜nbj)}κn𝒜n|α|maxj=1,,snn,jPCR(𝒜nbj) in probability. To prove (A4), note that |𝒜nα| ⩽ 𝒜nδ = O(1)δ → 0 as δ ↓ 0. By Taylor’s expansion,

minj=1,,sn{inf|α|δn,jPCR(𝒜nα)n,jPCR(𝒜nbj)}𝒜nminj=1,,sninf|α|δ{(αbj)1ni=1nq1(Yi;0)Xij}+𝒜n22minj=1,,sninf|α|δ{α21ni=1nq2(Yi;Xij𝒜nαj*)Xij2bj21ni=1nq2(Yi;Xij𝒜nbj*)Xij2}+𝒜nmin1jsninf|α|δ{κn(|α||bj|)}I1+I2+I3,

with αj* between 0 and α and bj* between 0 and bj. Let μ0 = F−1(0) and C0 = q″(μ0)/ F′(μ0) ≠ 0. Then

I1=𝒜nminj=1,,sninf|α|δ{C0(αbj)E(YXj)}+𝒜nminj=1,,sninf|α|δ[C0(αbj)1ni=1n{YiXijE(YXj)}]𝒜nmax1jsnsup|α|δ{C0μ0(αbj)1ni=1nXij}I1,1+I1,2+I1,3.

We see that |I1,3| ⩽ OP [𝒜n{log(sn)/ n}1/2]δ, by Bernstein’s inequality (Lemma 2.2.11 in van der Vaart & Wellner 1996). Again |I1,2| = OP [𝒜n{log(sn)/ n}1/2]δ by an argument similar to that of Theorem 2. Choosing bj = −2δsign{C0 E(Y Xj)}, which satisfies |bj| =2δ, gives I1,1|C0|c𝒜n2δ. For I2 and I3, we observe that |I2||OP(𝒜n2)δ2 and |I3| = O(𝒜nκn)δ.By the assumptions, we can choose a small enough δ > 0 such that with probability tending to 1, I1,2, I1,3, I2 and I3 are dominated by I1,1, which is positive. Thus (A4) is proved.

Part 2. To verify that w^min(II)λn in probability, it suffices to prove that for any > 0, there exist local minimizers β^jPCR of n,jPCR(α) such that limn→∞ pr(maxsn+1 ⩽ jpn| β^jPCR| ⩽ λn) = 1. Similar to the proof of Theorem 1, we will prove that for any > 0,

limnpr[minj=sn+1,,pn{inf|α|=n,jPCR(λnα)n,jPCR(0)}>0]=1. (A6)

For j = 1, …, sn, by Taylor’s expansion,

minj=sn+1,,pn{inf|α|=n,jPCR(λnα)n,jPCR(0)}λnminj=sn+1,,pninf|α|={α1ni=1nq1(Yi;0)Xij}+λn22minj=sn+1,,pninf|α|={α21ni=1nq2(Yi;Xijλnαj*)Xij2}+λninf|α|=(κn|α|)I1+I2+I3,

where αj* is between 0 and α. Similar to the proof in Part 1, |I1| ⩽ OPn{log(pnsn +1)/n}1/2] + on𝒝n). Note that |I2| ⩽ OP ( λn2)2 and I3 = λnκn. By assumptions, with probability tending to 1, I1 and I2 are dominated by I3 > 0. So (A6) is proved.

Proof of Theorem 9. We first need to show Lemma A1.

Lemma A1. Suppose that (Xo, Yo) follows the distribution of (X, Y) and is independent of the training set 𝒯n. If Q satisfies (3), then E[Q{Yo, (Xo)}] = E[Q{Yo, m(Xo)}] + E[Q{m(Xo), (Xo)}].

Proof. Let q be the generating function of Q. We deduce from Corollary 3, p. 223 of Chow & Teicher (1988) that E{q(Yo) | 𝒯n, Xo} = E{q(Yo) | Xo} and E[Yoq′{ (Xo)} | 𝒯n, Xo] = E(Y o | Xo)q′ { (Xo)} = m(Xo)q′{ (Xo)}.

We now show Theorem 9. Setting Q in Lemma A1 to be the misclassification loss gives

1/2[E{R(ϕ^)}R(ϕ)B)]E[|m(Xo)0.5|I{m(Xo)0.5,m^(Xo)>0.5}]+E[|m(Xo)0.5|I{m(Xo)>0.5,m^(Xo)0.5}]I1+I2.

For any > 0, I1 ⩽ pr{| (Xo) – m(Xo)| > } + and I2 + pr{| (Xo) – m(Xo)| ⩾ }. The proof completes by showing I1 0 and I2 0.

Supplementary material

Supplementary material is available at Biometrika online.

References

  1. Altun Y, Smola A. Unifying divergence minimization and statistical inference via convex duality. In: Lugosi G, Simon HU, editors. Learning Theory: 19th Ann Conf Learn Theory. Berlin: Springer; 2006. pp. 139–53. [Google Scholar]
  2. Bartlett MS. Approximate confidence intervals. Biometrika. 1953;40:12–19. [Google Scholar]
  3. Bickel P, Li B. Regularization in statistics (with discussion) Test. 2006;15:271–344. [Google Scholar]
  4. Bregman LM. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comp Math Math Phys. 1967;7:620–31. [Google Scholar]
  5. Chow YS, Teicher H. Probability Theory. 2nd ed. New York: Springer; 1988. [Google Scholar]
  6. Efron B. How biased is the apparent error rate of a prediction rule? J Am Statist Assoc. 1986;81:461–70. [Google Scholar]
  7. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32:407–99. [Google Scholar]
  8. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc. 2001;96:1348–60. [Google Scholar]
  9. Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–61. [Google Scholar]
  10. Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Baltimore, MD: Johns Hopkins University Press; 1996. [Google Scholar]
  11. Grünwald PD, Dawid AP. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann Statist. 2004;32:1367–433. [Google Scholar]
  12. Güvenir HA, Acar B, Demiröz G, Çekin A. A supervised machine learning algorithm for arrhythmia analysis. Comp Cardiol. 1997;24:433–6. [Google Scholar]
  13. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2001. [Google Scholar]
  14. Huang J, Ma SG, Zhang CH. Adaptive lasso for sparse high-dimensional regression models. Statist. Sinica. 2008;18:1603–18. [Google Scholar]
  15. Kivinen J, Warmuth MK. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Boosting as entropy projection; pp. 134–44. [Google Scholar]
  16. Knight K, Fu WJ. Asymptotics for lasso-type estimators. Ann Statist. 2000;28:1356–78. [Google Scholar]
  17. Lafferty JD, Della Piestra S, Della Piestra V. Statistical learning algorithms based on Bregman distances. Proc. 5th Can Workshop Info Theory 1997 [Google Scholar]
  18. Lafferty J. Proc 12th Ann Conf Comp Learn Theory. New York: ACM Press; 1999. Additive models, boosting, and inference for generalized divergences; pp. 125–33. [Google Scholar]
  19. McCullagh P. Quasi-likelihood functions. Ann Statist. 1983;11:59–67. [Google Scholar]
  20. Meinshausen N, Buhlmann P. High dimensional graphs and variable selection with the lasso. Ann Statist. 2006;34:1436–62. [Google Scholar]
  21. Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann Statist. 2007;35:1012–30. [Google Scholar]
  22. Shen X, Tseng GC, Zhang X, Wong WH. On ψ-learning. J Am Statist Assoc. 2003;98:724–34. [Google Scholar]
  23. Strimmer K. Modeling gene expression measurement error: a quasi-likelihood approach. BMC Bioinformatics. 2003;4:10. doi: 10.1186/1471-2105-4-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
  25. Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1996. [Google Scholar]
  26. van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. [Google Scholar]
  27. Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–47. [Google Scholar]
  28. Zhang CM, Jiang Y, Shang Z. New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Can J Statist. 2009;37:119–39. [Google Scholar]
  29. Zhang CM, Zhang ZJ. Regularized estimation of hemodynamic response function for fMRI data. Statist. Interface. 2010;3:15–32. [Google Scholar]
  30. Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–67. [Google Scholar]
  31. Zou H. The adaptive lasso and its oracle properties. J Am Statist Assoc. 2006;101:1418–29. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material is available at Biometrika online.


Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES