Abstract
The use of image covariates to build a classification model has lots of impact in various fields, such as computer science, medicine, and so on. The aim of this paper is to develop an estimation method for logistic regression model with image covariates. We propose a novel regularized estimation approach, where the regularization is a combination of L1 regularization and Sobolev norm regularization. The L1 penalty can perform variable selection, while the Sobolev norm penalty can capture the shape edges information of image data. We develop an efficient algorithm for the optimization problem. We also establish a nonasymptotic error bound on parameter estimation. Simulated studies and a real data application demonstrate that our proposed method performs very well.
Introduction
As one of the most important issues in machine learning field, classification plays a prominent role throughout various disciplines. Until now people have developed a large number of classification methods, such as KNN, Linear (Quadratic) discriminant analysis, logistic regression, naive bayes, decision tree, SVM, neural network, deep learning, and many others [1, 2]. Among all those methods logistic regression has a long history [3], and is one of the most popular approaches. Logistic regression model is a typical representative of generalized linear models and linear classification methods. Therefore, this article takes logistic regression as the research object. Traditionally, maximum likelihood method is usually used to obtain an estimator of the parameter in logistic regression model [4–6].
However, the big data era brings us massive complex data, one of whose most prominent characteristics is high dimensionality. The maximum likelihood estimation method in logistic regression model faces serious problems such as non-existence, non-uniqueness [7] in high dimensional settings. Regularization is a popular strategy to handle high dimensional problems [8]. Many regularized methods have been proposed over the past decades, including LASSO [9], the smoothly clipped absolute deviation (SCAD) penalty method [10], the minimax concave penalty (MCP) method [11], and so on. For high dimensional logistic regression, [12] considered L1-regularization path algorithm. [13] proposed an interior-point method for large-scale L1-regularized logistic regression. [14] proposed the group lasso for logistic regression. The L1/2 regularized logistic regression is considered by [15] for gene selection in cancer classification.
Image data is a very popular form of data, and generated in many fields, such as computer science, medicine, and so on. In addition to high dimensionality, image data usually contains spatially smooth regions with relatively sharp edges, which leads to its own characteristics including local smoothness [16], jump discontinuity [17], and many others. Local smoothness leads to highly correlated features, which makes the image classification problem more challenging [18]. Jump discontinuity makes conventional smoothing techniques inefficient [17]. On the other hand, using these characteristics in modeling process is often helpful for model efficiency enhancement, and has received a lot of attention recently. For example, [19] introduced a locally adaptive smoothing method for image restoration. [16] proposed Propagation-Separation approach for local likelihood estimation, which can handle local smoothness of image data. [20] developed an adaptive regression model for the analysis of neuroimaging data, which is a generalization of the PS approach. [21] studied theoretical performance of nonlocal means for noise removal of image data. [17] considered a spatially varying coefficient model for neuroimaging data with jump discontinuities. [18] proposed a spatially weighted principal component analysis (SWPCA) for imaging classification. [22] developed a generalized scalar-on-image regression models via total variation regularization, which can keep the piecewise smooth nature of imaging data. [23] proposed an efficient nuclear norm penalized estimation method for matrix linear discriminant analysis.
In this paper, we consider a logistic regression model with image covariates, and develop a regularized estimation approach, which combines the L1 regularization and the Sobolev norm regularization. The L1 penalty performs variable selection and removes covariates unrelated to the response from models [9]. The Sobolev norm penalty keeps characteristics of image data (e.g. local smoothness) in model fitting. In fact, the Sobolev regularization is a popular technology in image data analysis, such as image denoising [24], edge detection of images [25], and many others. The proposed regularization method is different from the aforementioned regularized logistic regression models. It is also different from the elastic net method [26], which is a combination of Lasso and ridge regression. The elastic net encourages the grouping effect, where strongly correlated predictors tend to be in or out of the model. However, the elastic net can not exploit structure information of image covariates, and is not suitable for models with image covariates. There are differences between our proposed method and the fused lasso method [27]. In many real data analysis, such as gene expression data, covariates have a order. Adjacent covariates are often highly correlated and have similar effects on the response variable. The fused lasso tends to make adjacent covariates share common effect on the response. The proposed method can be treated as the extended version of the fused lasso from one dimension to multidimensions. Moreover, the fusion term here is Sobolev norm penalty. Furthermore, we develop a novel algorithm to solve the optimization problem. The theoretical property of our estimator is also studied, and a nonasymptotic estimate error bound is given. Numerical studies including simulations and a real data analysis are also considered to verify the performance of our method.
The rest of the article is organized as follows. Section 2 presents the methodology, including model setup, algorithm, and theoretical property. Section 3 is numerical studies, where simulated studies and a real data application are presented. Lastly, we make a short conclusion in Section 4. The proof details of theoretical studies are put in Appendix Section.
Methodology
Model setup
Suppose that we have observations (Xi, Yi) with 1 ≤ i ≤ n, where Yi ∈ {−1, + 1} is the class label, and is the corresponding image covariate. We further assume that (Xi, Yi) with 1 ≤ i ≤ n are independent and identically distributed. In order to predict Yi with Xi, the following logistic regression model is assumed
| (1) |
where Pi = P(Yi = +1|Xi), is the corresponding coefficient image, and is the inner operator of two matrices. Let β = vec(B) = (β1,⋯,βpq)T and Xi = vec(Xi) = (xij, j = 1,⋯,pq)T for i = 1, ⋯, n, then < Xi, B >= βT Xi, and the model (1) is equivalent to . The true value of β is denoted by .
Traditionally, maximum likelihood method is usually used to estimate coefficient image B. The likelihood function is
and the corresponding log-likelihood function is Denote logistic loss function as , and the associated risk is denoted by . We assume that . The empirical risk is denoted by . Hence, maximizing the likelihood function is equivalent to minimizing the empirical risk
Many optimization methods, such as Newton-Raphson method [6], can be used to solve the above problem.
However, in the image covariate case, the corresponding coefficient image B is usually assumed to be a piecewise smooth image with unknown edges. This assumption is widely used in the imaging literature, and is critical for addressing various scientific questions [22]. The maximum likelihood method does not take advantage of these characteristics. Moreover, image covariate is usually high dimensional, and not every element of Xi is useful to predict Yi. But the maximum likelihood method can not perform variable selection. Consequently, we propose a novel estimation method for B in the next subsection, which can keep characteristics of image covariate such as local smoothing, and perform variable selection simultaneously.
Estimation
For the coefficient image B, we define its discrete gradient as
(∇B)jk = (bj+1,k − bj,k, bj,k+1 − bj,k) is the discrete gradient at the position (j, k). Furthermore, bj+1,k − bj,k is the discrete gradient in the vertical direction, and bj,k+1 − bj,k indicates the discrete gradient in the horizontal direction. The Sobolev norm of B is the L2 norm of ∇B, which is written as
In fact, we can rewrite as a quadratic form of β. Specifically, we define a matrix with dij defined in the following formula (2)
| (2) |
Then one can easily verify that . We also present the matrix D with a graph in the case p = q = 3 for the purpose of understanding. Please see Fig (1).
Fig 1. The matrix D in the case p = q = 3.
We then consider the following optimization problem
| (3) |
where , and ‖⋅‖1 is the L1 norm. The term λ1 βT DT Dβ shrinks adjacent elements of B to be similar, hence it can capture the local smoothing of B. The term λ2‖β‖1 shrinks the elements of B to 0, and performs variable selection. We next propose an algorithm to solve the optimization problem (3).
Algorithm
For the optimization problem (3), we define K = DT D = (kjl) and , then one can see that Q(β) = H(β) + λ2‖β‖1. This indicates that the function Q(β) is a convex function with the separable structure [28]. [29] shows that the coordinate descent algorithm can be guaranteed to converge to the global minimizer for any convex optimization function with the separable structure. Hence we here propose a coordinate descent algorithm to obtain the solution of the optimization problem (3).
For j = 1, ⋯, pq, we successively minimize Q(β) along βj direction with other parameters fixed. Specifically, denote the current value of β as βc, and for i = 1, ⋯, n. For j = 1, ⋯, pq, we use the second order Taylor expansion to approximate function with fixed. Specifically,
Hence,
Moreover,
where C is a constant containing no information about βj. Denote by . One can update βj through minimizing . Specifically, by
where is the subderivative, that is if βj ≠ 0 and otherwise, we have that
Consequently, one can update βj as
where .
We summarize the algorithm as follows.
Coordinate
Step 1. Initialization. Given initial value β.
-
Step 2. For t = 1, 2, ⋯, update β.
For j = 1, ⋯, p- Compute pi for 1 ≤ i ≤ n;
- Let ;
- Update ;
End for.
Step 3. Repeat Step 2 until convergence.
By the proposed algorithm, we can obtain the solution of (3), which is denoted by . As the estimator for β, the theoretical properties of are studied in the next subsection.
Theoretical properties
In this subsection, we consider the properties of . A nonasymptotic error bound of is given. We assume that the true value β* is sparse. Let , and be the cardinality of I*. For the purpose of theoretical studies, we make the following assumptions.
Assumption 1. Assume that there exists a constant L such that |xij| ≤ L for every 1 ≤ i ≤ n, 1 ≤ j ≤ pq.
Assumption 2. Assume that there exists a constant C such that ‖β*‖1 ≤ C.
Assumption 3. For the matrix K, assume that there exists a constant C0 such that λmax(K) ≤ C0, where λmax(K) is the largest eigenvalue of K.
Assumption 4. Let . Define the set Assume that there exists a constant 0 < b ≤ 1 such that for every β ∈ Vα, ϵ,
Assumption 1 makes a common bound L for all xij with i = 1, ⋯, n, j = 1, ⋯, pq. Assumption 2 gives a bound for ‖β*‖1. Combining Assumptions 1 and 2, one can make sure that Pi with 1 ≤ i ≤ n are bounded away from zero and one. Pi equalling zero or one will cause the i-th subject to be either ignorable or dominant in the likelihood function, that is not expected to appear in statistical analysis. This case can be avoided by Assumptions 1 and 2. In Assumption 3, we assume that the largest eigenvalue of K is bounded. Assumption 4 is called Condition Stabil, which can be regarded as a stability requirement on the correlation structure [30]. Under these assumptions, we have the following theorem.
Theorem 1 Assume that Assumptions 1-3 are true and Assumption 4 holds for with d = max{pq, n}, let λ1 = λ2/(6CC0), if
| (4) |
then we have that
where s = (1 + eA)−4 with A = 8CL is a constant.
The proof of Theorem 1 is put in the appendix section. The theorem shows that with a high probability, the L1 norm of estimate error is bounded by 3k*λ2/(sb) + (1 + 3s/λ2)ϵ. One can see that the term (1 + 3s/λ2)ϵ = O(d/2d), which can be negligible for large d. Hence, the term 3k*λ2/(sb) dominates the upper bound, which becomes larger when b becomes smaller. If further assume that ln(pq) = o(n), by the condition (4) one can see that λ2 can tend to 0. Further 3k*λ2/(sb) → 0, that means the upper bound can tend to zero. Consequently, the consistency of can be guaranteed.
The selection of tuning parameters
The optimization function (3) contains two tuning parameters λ1 and λ2, which should be determined by some criteria, such as BIC, cross validation method. In our simulation studies, we select the tuning parameters by a validation set. And in real data analysis, the cross validation method is used. Before applying these methods, one should firstly determine the value range of tuning parameters. Specifically, we here make a transformation of λ1 and λ2. Let λ = λ1 + λ2, and α = λ2/λ. Then the penalty terms in (3) can be rewritten as λ(α‖β‖1 + (1 − α)βT Kβ). Because α ∈ [0, 1], the alternative values of α are set as 0.02κ for κ = 1, ⋯, 50. With a given α, we denote λ0 as the threshold value. Once λ ≥ λ0, the solution of (3) is exactly zero. By
one can see that once β = 0 is the solution, every element of belongs to [−1, 1]. This means that . Following the idea of [14], the alternative values of λ are set as 0.001 and 0.96νλ0 for ν = 0, 1, ⋯, 160. For the validation set method, the prediction error on the validation set of our approach with tuning parameters α, λ is denoted by PE(α, λ). The final α, λ are selected as the minimizer of PE(α, λ).
For the M-fold cross validation method, the data are randomly divided into M folds of approximately equal size. For m = 1, ⋯, M, we treat the m fold as the validation set, and fit the model with tuning parameters α, λ on the remaining M − 1 flods. The corresponding prediction error on the validation set is denoted by and the cross validation prediction error is defined as
The α, λ are selected as the minimizer of the cross validation prediction error [6].
Numerical studies
In this section, we evaluate the performance of our proposed method by two simulated examples and a real data analysis. For the purpose of comparison, we also consider the logistic regression model with L1 penalty [12, 13], the logistic regression model with fused lasso penalty, and linear support vector machine, which are denoted by LG-L1, LG-fused, and Linear SVM respectively for convenience. Meanwhile, our proposed method is abbreviated as LG-sob.
Simulation studies
Example 1. We generate data from the following model
where Xi and B0 both belong to . One result caused by image covariates is that the corresponding regression coefficient can be treated as a image too. Hence we here just treat B0 as images, while Xi is generated from a multivariate normal distribution. Specifically, we define the vectorization of Xi as Xi, and Xi is generated from a multivariate normal distribution with mean 0 and covariance for any 1 ≤ j1, j2 ≤ 1024. The parameter image B0 is considered in two cases, which have been shown in Fig 2. The first case of B0 denoted by B01 is a bird picture, in which the blue region takes value 0, and the yellow region takes value 1. The other case of B0 denoted by B02 is a butterfly picture, which is more complicated and takes values in interval [−0.0197, 0.0628]. Given Xi and B0, the response Yi is generated from a two-point distribution .
Fig 2. Simulated example.
The true parameter images B0.
Example 2. In this example, the mechanism of data generation is similar to that for Example 1, the only difference is that we generate Xi in a more complex way. In particular, we follow the simulation scheme of [22] and generate Xi from a 32 × 32 phantom map with 1024 pixels according to a spatially correlated random process Xi = ∑l l−1 ηil φl, among which the ηil are standard normal random variables and the φl are bivariate Haar wavelet basis functions.
For these two simulated examples, along with the training set with sample size n, we also generate a validation set and a test set with sample sizes both being 500. We train the model on the training set, select tuning parameters through the validation set, and calculate the classification accuracy on the test set to evaluate the performance of the model.
For every specification of the parameter B0 and sample size n, we replicate the simulation 100 times for each example, and the average prediction errors are computed and summarized in Table 1 for Example 1, and Table 2 for Example 2 respectively. Besides the prediction errors, we also calculate the average estimation errors for LG-sob, LG-L1 and LG-fused, where is the parameter image estimator in the i-th time. From the results, one can see that our proposed method performs better than the other three methods in all cases from the prediction perspective. As sample size n becomes larger, the prediction errors will become smaller, but the estimation errors do not decrease congruously. The reason may be that the tuning parameters are selected based on minimization of prediction error.
Table 1. Results of simulated example 1: Prediction error (PE) and estimation error (EE).
| (n, B0) | (500, B01) | (1000, B01) | (500, B02) | (1000, B02) | ||||
|---|---|---|---|---|---|---|---|---|
| PE | EE | PE | EE | PE | EE | PE | EE | |
| LG-sob | 0.099 | 337.645 | 0.075 | 336.173 | 0.107 | 8.153 | 0.080 | 13.505 |
| LG-L1 | 0.272 | 404.589 | 0.199 | 375.926 | 0.272 | 17.354 | 0.204 | 23.143 |
| LG-fused | 0.248 | 423.242 | 0.190 | 406.866 | 0.248 | 7.016 | 0.190 | 9.400 |
| Linear SVM | 0.221 | NA | 0.174 | NA | 0.223 | NA | 0.172 | NA |
Table 2. Results of simulated example 2: Prediction error (PE) and estimation error (EE).
| (n, B0) | (500, B01) | (1000, B01) | (500, B02) | (1000, B02) | ||||
|---|---|---|---|---|---|---|---|---|
| PE | EE | PE | EE | PE | EE | PE | EE | |
| LG-sob | 0.028 | 181.06 | 0.023 | 176.19 | 0.049 | 582.94 | 0.038 | 1096.1 |
| LG-L1 | 0.073 | 1712.6 | 0.058 | 1501.7 | 0.116 | 2190.9 | 0.090 | 3022.4 |
| LG-fused | 0.052 | 438.57 | 0.044 | 495.73 | 0.097 | 482.19 | 0.076 | 775.99 |
| Linear SVM | 0.050 | NA | 0.038 | NA | 0.084 | NA | 0.071 | NA |
Moreover, we also randomly select one simulated result from the 100 replications of Example 1, and show the parameter image estimations in Fig 3. One can see that our proposed LG-sob method can capture the shapes of images, but LG-L1 and LG-fused do not have this property.
Fig 3. Simulated example.
One of randomly selected parameter images estimations. The first row is the results of our proposed LG-sob, the second row is the results of LG-L1, the third row is the results of LG-fused.
A real data analysis
The classification of the ZIP Code Dataset is a benchmark problem in machine learning community [6]. One can obtain the ZIP Code Dataset from the following website https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html [28]. The Dataset contains normalized handwritten digits, which are automatically scanned from envelopes by the U.S. Postal Service. Every observation is a handwritten digit, and is represented as a size normalized 16 × 16 grayscale image [31]. The purpose is to use the 256 pixel values to predict the corresponding digit. The Dataset contains a training set with 7291 observations and a test set with 2007 observations. Because this article only considers the binary response prediction by logistic regression models, and it looks like that numbers 3 and 8 have more similar characteristics, hence we only consider handwritten 3’s and 8’s in this paper. The sizes of handwritten 3’s and 8’s are 658 and 542 respectively in the train set, while they are both 166 in the test set. Fig 4 shows some examples of handwritten 3’s and 8’s.
Fig 4. Real data analysis.
Some examples of handwritten 3’s and 8’s.
More specifically, we denote the i-th observation by , and define the corresponding class label Yi = −1 if Xi represents handwritten 3 and Yi = + 1 if Xi represents handwritten 8. Our proposed method is applied to construct the classifier for the prediction of Yi (i.e. handwritten numeral) based on the grayscale image Xi. We train the model on the training set, and evaluate the performance of the proposed method on the test set by classification accuracy. For the purpose of comparison, we also consider the logistic regression model with only L1 penalty.
The tuning parameters are selected by 10-fold cross validation (CV) method. The CV prediction errors in various parameters setting are calculated and plotted in Fig 5. Finally, our proposed method selects the tuning parameters as α = 0.04, λ = 0.0118, while the method with L1 penalty selects the tuning parameter as λ = 0.0014. The parameter image estimations of the two methods are shown in Fig 6. One can see that our proposed method tends to make adjacent pixels have similar effects on the model. Meanwhile, LG-L1 tries to obtain a more sparse parameter estimation, and LG-fused method tries to make pixels only adjacent in the vertical direction have similar effects. The top-left region of parameter image has positive effects on handwritten numeral 8, and the bottom-right region has positive effects on handwritten numeral 3. The classification accuracy on the test set of our proposed method is 96.99%, while the accuracies of LG-L1, LG-fused, and Linear SVM are 96.39%, 96.08% and 96.39%, respectively. The proposed method performs better.
Fig 5. Real data analysis.
The results of 10-fold CV: Prediction error in various parameters settings.
Fig 6. Real data analysis.
The parameter image estimations. Left: the estimation of our proposed LG-sob; Middle: the estimation of LG-L1; Right: the estimation of LG-fused.
Conclusion
We have developed a novel estimation method for logistic regression with image covariates. This method can not only perform variable selection, but also capture the shape features of the parameter images. Both theoretical results and numerical studies show that our method performs well. We have proposed a coordinated descent algorithm to solve the optimization problem, and the global convergence of the algorithm is guaranteed. However, as pointed out by one referee, the coordinated descent algorithm is very time consuming, especially in the case of high dimensional image covariates. Many more efficient optimalization approaches, such as Nesterov accelerated gradient methods [32], interior-point methods [13], may be more suitable. We will research this issue in future. Furthermore, our method is mainly based on Sobolev norm regularization, compared to which total variation regularization is more sensitive to capture sharp edges and jumps of parameter images. However, the algorithm of total variation regularization based estimation method is more complex, which can be our future work to study.
Appendix: Proof of Theorem 1
Before giving the proof of Theorem 1, we first list the bounded differences inequality as the following lemma without proof.
Lemma 1 (the Bounded Differences Inequality) Suppose that are independent, and the function satisfies the bounded difference assumption
for i = 1, ⋯, n. Then for all t > 0,
For more details about Lemma 1 and its proof, one can refer to [33]. The following is the proof of Theorem 1.
Proof of Theorem 1. By the definitions of and β*, one can see that
and
Hence, we have that
| (5) |
Moreover,
| (6) |
We first consider the term . Specifically, define . Let , and
which is the empirical measure corresponding to replacing (Yl, Xl) by . Then
among which the inequality is obtained by a first order Taylor expansion and the assumption 1. Then by Lemma 1, we can obtain that
Let , then we have that , and P(Ln − E(Ln) ≥ u) ≤ δ.
Let d = max{pq, n}. Taking , by the lemma 3 of [34] with Cφ = 1, CF = L, we have
Consequently, we have that
By the condition of Theorem 1, we know , hence P(Ln ≤ λ2/3) ≥ 1 − δ.
On the event {Ln ≤ λ2/3}, we have that
| (7) |
Secondly, we consider the term . Based on Assumptions 2 and 3, one can see that
One can see that if λ1 = λ2/(6CC0), we have
| (8) |
Consequently, on the event {Ln ≤ λ2/3} we combine (6), (7), (8), and obtain
| (9) |
Hence we have that
| (10) |
By (9) one can also obtain that
Consequently, we have . This means that .
By the example 4.5 in [35], we have that , where P(β) = 1/(1+ e−XTβ) and EX(⋅) is the expectation with respect to the distribution of X. Using Taylor expansion, one can obtain that
where for some τ ∈ (0, 1). Moreover, by (10) and Assumptions 1-2, we have This means that
where s = (1 + eA)−4 and A = 8CL, then by Assumption 4 we have that
| (11) |
Furthermore, we have
where the first inequality follows by (11), the second inequality follows by (5), and the fourth inequality is obtained by combining the results of (7) and (8). Consequently, we have
where a is a positive constant and the second inequality follows by
Let a = λ2/(3sb), then
This completes the proof of the Theorem.
Acknowledgments
We thank the Editor, the AE and four referees for their helpful comments and valuable suggestions, which make the article have a greatly improvement.
Data Availability
The real ZIP Code Dataset from the following website: https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html.
Funding Statement
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2007. [Google Scholar]
- 2. Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press; 2016. [Google Scholar]
- 3. Cox DR. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society Series B (Methodological). 1958;20(2):215–242. 10.1111/j.2517-6161.1958.tb00292.x [DOI] [Google Scholar]
- 4. Albert A, Anderson JA. On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1984;71(1):1–10. 10.1093/biomet/71.1.1 [DOI] [Google Scholar]
- 5. Santner TJ, Duffy DE. A Note on A. Albert and J. A. Anderson’s Conditions for the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1986;73(3):755–758. 10.1093/biomet/73.3.755 [DOI] [Google Scholar]
- 6. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. 2nd ed Springer; 2009. Available from: http://www-stat.stanford.edu/~tibs/ElemStatLearn/. [Google Scholar]
- 7. Silvapulle MJ. On the Existence of Maximum Likelihood Estimators for the Binomial Response Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1981;43(3):310–313. [Google Scholar]
- 8.Sun Y. Regularization in High-dimensional Statistics. PhD dissertation stanford university. 2015.
- 9. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1996;58(1):267–288. [Google Scholar]
- 10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001;96(456):1348–1360. 10.1198/016214501753382273 [DOI] [Google Scholar]
- 11. Zhang C. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38(2):894–942. 10.1214/09-AOS729 [DOI] [Google Scholar]
- 12. Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(4):659–677. 10.1111/j.1467-9868.2007.00607.x [DOI] [Google Scholar]
- 13. Koh K, Kim S, Boyd SP. An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression. Journal of Machine Learning Research. 2007;8:1519–1555. [Google Scholar]
- 14. Meier L, De Geer SV, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):53–71. 10.1111/j.1467-9868.2007.00627.x [DOI] [Google Scholar]
- 15. Liang Y, Liu C, Luan X, Leung K, Chan T, Xu Z, et al. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013;14:198 10.1186/1471-2105-14-198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Polzehl J, Spokoiny V. Propagation-Separation Approach for Local Likelihood Estimation. Probability Theory and Related Fields. 2006;135(3):335–362. 10.1007/s00440-005-0464-1 [DOI] [Google Scholar]
- 17. Zhu H, Fan J, Kong L. Spatially Varying Coefficient Model for Neuroimaging Data With Jump Discontinuities. Journal of the American Statistical Association. 2014;109(507):1084–1098. 10.1080/01621459.2014.881742 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Guo R, Ahn M, Zhu H. Spatially Weighted Principal Component Analysis for Imaging Classification. Journal of Computational and Graphical Statistics. 2015;24(1):274–296. 10.1080/10618600.2014.912135 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Polzehl J, Spokoiny V. Adaptive Weights Smoothing with Applications to Image Restoration. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2000;62(2):335–354. 10.1111/1467-9868.00235 [DOI] [Google Scholar]
- 20. Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG. Multiscale adaptive regression models for neuroimaging data. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2011;73(4):559–578. 10.1111/j.1467-9868.2010.00767.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Ariascastro E, Salmon J, Willett R. Oracle inequalities and minimax rates for non-local means and related adaptive kernel-based methods. SIAM Journal on Imaging Sciences. 2012;5(3):944–992. 10.1137/110859403 [DOI] [Google Scholar]
- 22. Wang X, Zhu H, for the Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association. 2017;112(519):1156–1168. 10.1080/01621459.2016.1194846 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Hu W, Shen W, Zhou H, Kong D. Matrix Linear Discriminant Analysis. Technometrics. 2019;0(0):1–10. [PMC free article] [PubMed] [Google Scholar]
- 24.Peyré G. Denoising by Sobolev and Total Variation Regularization. http://www.numerical-tours.com/matl#ab/denoisingsimp_4_denoiseregul/; 2019.
- 25. Qiu P. Image Processing and Jump Regression Analysis. Wiley-Interscience; 2005. [Google Scholar]
- 26. Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(2):301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
- 27. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(1):91–108. 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]
- 28. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC; 2015. [Google Scholar]
- 29. Tseng P. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. Journal of Optimization Theory and Applications. 2001;109(3):475–494. 10.1023/A:1017501703105 [DOI] [Google Scholar]
- 30. Bunea F. Honest variable selection in linear and logistic regression models via l1 and l1 + l2 penalization. Electronic Journal of Statistics. 2008;2:1153–1194. 10.1214/08-EJS287 [DOI] [Google Scholar]
- 31. Lecun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems. 1989; p. 396–404. [Google Scholar]
- 32. Yu N. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140(1):125–161. 10.1007/s10107-012-0629-5 [DOI] [Google Scholar]
- 33. Devroye L, Lugosi G. Combinatorial Methods in Density Estimation. Springer; 2001. [Google Scholar]
- 34. Wegkamp M. Lasso type classifiers with a reject option. Electronic Journal of Statistics. 2007;1(3):155–168. 10.1214/07-EJS058 [DOI] [Google Scholar]
- 35. Steinwart I. How to Compare Different Loss Functions and Their Risks. Constructive Approximation. 2007;26(2):225–287. 10.1007/s00365-006-0662-3 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The real ZIP Code Dataset from the following website: https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html.






