Logistic regression with image covariates via the combination of L1 and Sobolev regularizations

Baiguo An; Beibei Zhang

doi:10.1371/journal.pone.0234975

. 2020 Jun 26;15(6):e0234975. doi: 10.1371/journal.pone.0234975

Logistic regression with image covariates via the combination of L₁ and Sobolev regularizations

Baiguo An ^1,^*,^#, Beibei Zhang ^1,^#

Editor: Mihye Ahn²

PMCID: PMC7319310 PMID: 32589677

Abstract

The use of image covariates to build a classification model has lots of impact in various fields, such as computer science, medicine, and so on. The aim of this paper is to develop an estimation method for logistic regression model with image covariates. We propose a novel regularized estimation approach, where the regularization is a combination of L₁ regularization and Sobolev norm regularization. The L₁ penalty can perform variable selection, while the Sobolev norm penalty can capture the shape edges information of image data. We develop an efficient algorithm for the optimization problem. We also establish a nonasymptotic error bound on parameter estimation. Simulated studies and a real data application demonstrate that our proposed method performs very well.

Introduction

As one of the most important issues in machine learning field, classification plays a prominent role throughout various disciplines. Until now people have developed a large number of classification methods, such as KNN, Linear (Quadratic) discriminant analysis, logistic regression, naive bayes, decision tree, SVM, neural network, deep learning, and many others [1, 2]. Among all those methods logistic regression has a long history [3], and is one of the most popular approaches. Logistic regression model is a typical representative of generalized linear models and linear classification methods. Therefore, this article takes logistic regression as the research object. Traditionally, maximum likelihood method is usually used to obtain an estimator of the parameter in logistic regression model [4–6].

However, the big data era brings us massive complex data, one of whose most prominent characteristics is high dimensionality. The maximum likelihood estimation method in logistic regression model faces serious problems such as non-existence, non-uniqueness [7] in high dimensional settings. Regularization is a popular strategy to handle high dimensional problems [8]. Many regularized methods have been proposed over the past decades, including LASSO [9], the smoothly clipped absolute deviation (SCAD) penalty method [10], the minimax concave penalty (MCP) method [11], and so on. For high dimensional logistic regression, [12] considered L₁-regularization path algorithm. [13] proposed an interior-point method for large-scale L₁-regularized logistic regression. [14] proposed the group lasso for logistic regression. The L_1/2 regularized logistic regression is considered by [15] for gene selection in cancer classification.

Image data is a very popular form of data, and generated in many fields, such as computer science, medicine, and so on. In addition to high dimensionality, image data usually contains spatially smooth regions with relatively sharp edges, which leads to its own characteristics including local smoothness [16], jump discontinuity [17], and many others. Local smoothness leads to highly correlated features, which makes the image classification problem more challenging [18]. Jump discontinuity makes conventional smoothing techniques inefficient [17]. On the other hand, using these characteristics in modeling process is often helpful for model efficiency enhancement, and has received a lot of attention recently. For example, [19] introduced a locally adaptive smoothing method for image restoration. [16] proposed Propagation-Separation approach for local likelihood estimation, which can handle local smoothness of image data. [20] developed an adaptive regression model for the analysis of neuroimaging data, which is a generalization of the PS approach. [21] studied theoretical performance of nonlocal means for noise removal of image data. [17] considered a spatially varying coefficient model for neuroimaging data with jump discontinuities. [18] proposed a spatially weighted principal component analysis (SWPCA) for imaging classification. [22] developed a generalized scalar-on-image regression models via total variation regularization, which can keep the piecewise smooth nature of imaging data. [23] proposed an efficient nuclear norm penalized estimation method for matrix linear discriminant analysis.

In this paper, we consider a logistic regression model with image covariates, and develop a regularized estimation approach, which combines the L₁ regularization and the Sobolev norm regularization. The L₁ penalty performs variable selection and removes covariates unrelated to the response from models [9]. The Sobolev norm penalty keeps characteristics of image data (e.g. local smoothness) in model fitting. In fact, the Sobolev regularization is a popular technology in image data analysis, such as image denoising [24], edge detection of images [25], and many others. The proposed regularization method is different from the aforementioned regularized logistic regression models. It is also different from the elastic net method [26], which is a combination of Lasso and ridge regression. The elastic net encourages the grouping effect, where strongly correlated predictors tend to be in or out of the model. However, the elastic net can not exploit structure information of image covariates, and is not suitable for models with image covariates. There are differences between our proposed method and the fused lasso method [27]. In many real data analysis, such as gene expression data, covariates have a order. Adjacent covariates are often highly correlated and have similar effects on the response variable. The fused lasso tends to make adjacent covariates share common effect on the response. The proposed method can be treated as the extended version of the fused lasso from one dimension to multidimensions. Moreover, the fusion term here is Sobolev norm penalty. Furthermore, we develop a novel algorithm to solve the optimization problem. The theoretical property of our estimator is also studied, and a nonasymptotic estimate error bound is given. Numerical studies including simulations and a real data analysis are also considered to verify the performance of our method.

The rest of the article is organized as follows. Section 2 presents the methodology, including model setup, algorithm, and theoretical property. Section 3 is numerical studies, where simulated studies and a real data application are presented. Lastly, we make a short conclusion in Section 4. The proof details of theoretical studies are put in Appendix Section.

Methodology

Model setup

Suppose that we have observations (X_i, Y_i) with 1 ≤ i ≤ n, where Y_i ∈ {−1, + 1} is the class label, and $X_{i} = (x_{j k}^{(i)} : j = 1, \dots, p; k = 1, \dots, q) \in R^{p \times q}$ is the corresponding image covariate. We further assume that (X_i, Y_i) with 1 ≤ i ≤ n are independent and identically distributed. In order to predict Y_i with X_i, the following logistic regression model is assumed

\begin{matrix} log \frac{P_{i}}{1 - P_{i}} = < X_{i}, B >, \end{matrix}

(1)

where P_i = P(Y_i = +1|X_i), $B = (b_{j k}) \in R^{p \times q}$ is the corresponding coefficient image, and $< X_{i}, B > = \sum_{j = 1}^{p} \sum_{k = 1}^{q} x_{j k}^{(i)} b_{j k}$ is the inner operator of two matrices. Let β = vec(B) = (β₁,⋯,β_pq)^T and X_i = vec(X_i) = (x_ij, j = 1,⋯,pq)^T for i = 1, ⋯, n, then < X_i, B >= β^T X_i, and the model (1) is equivalent to $P_{i} = 1 / (1 + e^{- β^{T} X_{i}})$ . The true value of β is denoted by $β^{*} = {(β_{1}^{*}, \dots, β_{p q}^{*})}^{T}$ .

Traditionally, maximum likelihood method is usually used to estimate coefficient image B. The likelihood function is

\begin{matrix} L (β) = \prod_{i = 1}^{n} P_{i}^{I (Y_{i} = + 1)} {(1 - P_{i})}^{I (Y_{i} = - 1)} = \prod_{i = 1}^{n} {(1 + e^{- Y_{i} β^{T} X_{i}})}^{- 1}, \end{matrix}

and the corresponding log-likelihood function is $ln (L (β)) = - \sum_{i = 1}^{n} log (1 + e^{- Y_{i} β^{T} X_{i}}) .$ Denote logistic loss function as $l (β) = log (1 + e^{- Y_{i} β^{T} X_{i}})$ , and the associated risk is denoted by $P l (β) = E l (β)$ . We assume that $β^{*} = arg {min}_{β} P l (β)$ . The empirical risk is denoted by $P_{n} l (β) = n^{- 1} \sum_{i = 1}^{n} log (1 + e^{- Y_{i} β^{T} X_{i}})$ . Hence, maximizing the likelihood function is equivalent to minimizing the empirical risk

\begin{matrix} min_{β} P_{n} l (β) . \end{matrix}

Many optimization methods, such as Newton-Raphson method [6], can be used to solve the above problem.

However, in the image covariate case, the corresponding coefficient image B is usually assumed to be a piecewise smooth image with unknown edges. This assumption is widely used in the imaging literature, and is critical for addressing various scientific questions [22]. The maximum likelihood method does not take advantage of these characteristics. Moreover, image covariate is usually high dimensional, and not every element of X_i is useful to predict Y_i. But the maximum likelihood method can not perform variable selection. Consequently, we propose a novel estimation method for B in the next subsection, which can keep characteristics of image covariate such as local smoothing, and perform variable selection simultaneously.

Estimation

For the coefficient image B, we define its discrete gradient $\nabla B \in R^{p \times q \times 2}$ as

\begin{matrix} {(\nabla B)}_{j k} = {\begin{matrix} (b_{j + 1, k} - b_{j, k}, b_{j, k + 1} - b_{j, k}), & j < p, k < q, \\ (0, b_{j, k + 1} - b_{j, k}), & j = p, k < q, \\ (b_{j + 1, k} - b_{j, k}, 0), & j < p, k = q, \\ (0, 0), & j = p, k = q . \end{matrix} \end{matrix}

(∇B)_jk = (b_j+1,k − b_j,k, b_j,k+1 − b_j,k) is the discrete gradient at the position (j, k). Furthermore, b_j+1,k − b_j,k is the discrete gradient in the vertical direction, and b_j,k+1 − b_j,k indicates the discrete gradient in the horizontal direction. The Sobolev norm of B is the L₂ norm of ∇B, which is written as

\begin{matrix} {‖ B ‖}_{S o b} = (\sum_{j = 1}^{p} \sum_{k = 1}^{q} {(\nabla B)}_{j k}^{2})^{1 / 2} . \end{matrix}

In fact, we can rewrite ${‖ B ‖}_{S o b}^{2}$ as a quadratic form of β. Specifically, we define a matrix $D = (d_{i j}) \in R^{(2 p q - p - q) \times p q}$ with d_ij defined in the following formula (2)

\begin{matrix} d_{i j} = {\begin{matrix} - 1, & i = (2 p - 1) (k - 1) + s, j = p (k - 1) + (s + 1) / 2, \\ k = 1, \dots, (q - 1), s = 1, 3, \dots, (2 p - 3); \\ 1, & i = (2 p - 1) (k - 1) + s, j = p (k - 1) + (s + 1) / 2 + 1, \\ k = 1, \dots, (q - 1), s = 1, 3, \dots, (2 p - 3); \\ - 1, & i = (2 p - 1) (k - 1) + s, j = p (k - 1) + s / 2, \\ k = 1, \dots, (q - 1), s = 2, 4, \dots, (2 p - 2); \\ 1, & i = (2 p - 1) (k - 1) + s, j = p (k - 1) + s / 2 + p, \\ k = 1, \dots, (q - 1), s = 2, 4, \dots, (2 p - 2); \\ - 1, & i = (2 p - 1) k, j = k p, k = 1, \dots, (q - 1); \\ 1, & i = (2 p - 1) k, j = k p + p, k = 1, \dots, (q - 1); \\ - 1, & i > (2 p - 1) (q - 1), j = i - (p - 1) (q - 1); \\ 1, & i > (2 p - 1) (q - 1), j = i - (p - 1) (q - 1) + 1, \\ 0, & else . \end{matrix} \end{matrix}

(2)

Then one can easily verify that ${‖ B ‖}_{S o b}^{2} = β^{T} D^{T} D β$ . We also present the matrix D with a graph in the case p = q = 3 for the purpose of understanding. Please see Fig (1).

We then consider the following optimization problem

\begin{matrix} min_{β} Q (β), \end{matrix}

(3)

where $Q (β) = P_{n} l (β) + λ_{1} β^{T} D^{T} D β + λ_{2} {‖ β ‖}_{1}$ , and ‖⋅‖₁ is the L₁ norm. The term λ₁ β^T D^T Dβ shrinks adjacent elements of B to be similar, hence it can capture the local smoothing of B. The term λ₂‖β‖₁ shrinks the elements of B to 0, and performs variable selection. We next propose an algorithm to solve the optimization problem (3).

Algorithm

For the optimization problem (3), we define K = D^T D = (k_jl) and $H (β) = P_{n} l (β) + λ_{1} β^{T} K β$ , then one can see that Q(β) = H(β) + λ₂‖β‖₁. This indicates that the function Q(β) is a convex function with the separable structure [28]. [29] shows that the coordinate descent algorithm can be guaranteed to converge to the global minimizer for any convex optimization function with the separable structure. Hence we here propose a coordinate descent algorithm to obtain the solution of the optimization problem (3).

For j = 1, ⋯, pq, we successively minimize Q(β) along β_j direction with other parameters fixed. Specifically, denote the current value of β as β^c, and $p_{i}^{c} = P (Y_{i} | X_{i}, β^{c}) = 1 / (1 + e^{- Y_{i} X_{i}^{T} β^{c}})$ for i = 1, ⋯, n. For j = 1, ⋯, pq, we use the second order Taylor expansion to approximate function $H (β_{- j}^{c}, β_{j})$ with $β_{- j}^{c} = {(β_{1}^{c}, \dots, β_{j - 1}^{c}, β_{j + 1}^{c}, \dots, β_{p q}^{c})}^{T}$ fixed. Specifically,

\begin{matrix} \frac{\partial H (β_{- j}^{c}, β_{j})}{\partial β_{j}} |_{β_{j} = β_{j}^{c}} = n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}, \end{matrix}

\begin{matrix} \frac{\partial^{2} H (β_{- j}^{c}, β_{j})}{\partial β_{j}^{2}} |_{β_{j} = β_{j}^{c}} = n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j} . \end{matrix}

Hence,

\begin{matrix} H (β_{- j}^{c}, β_{j}) & \approx & H (β_{- j}^{c}, β_{j}^{c}) + (n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}) (β_{j} - β_{j}^{c}) \\ + & \frac{1}{2} (n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j}) {(β_{j} - β_{j}^{c})}^{2} . \end{matrix}

Moreover,

\begin{matrix} Q (β_{- j}^{c}, β_{j}) & \approx & (n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}) (β_{j} - β_{j}^{c}) \\ + & \frac{1}{2} (n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j}) {(β_{j} - β_{j}^{c})}^{2} + λ_{2} | β_{j} | + C, \end{matrix}

where C is a constant containing no information about β_j. Denote $(n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}) (β_{j} - β_{j}^{c}) + \frac{1}{2} (n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j}) {(β_{j} - β_{j}^{c})}^{2} + λ_{2} | β_{j} |$ by $\tilde{Q} (β_{j})$ . One can update β_j through minimizing $\tilde{Q} (β_{j})$ . Specifically, by

\begin{matrix} \begin{array}{c} \frac{\partial \tilde{Q} (β_{j})}{\partial β_{j}} = (n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}) \\ + (n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j}) (β_{j} - β_{j}^{c}) + λ_{2} \frac{\partial | β_{j} |}{\partial β_{j}} = 0, \end{array} \end{matrix}

where $\frac{\partial | β_{j} |}{\partial β_{j}}$ is the subderivative, that is $\frac{\partial | β_{j} |}{\partial β_{j}} = s i g n (β_{j})$ if β_j ≠ 0 and $\frac{\partial | β_{j} |}{\partial β_{j}} \in [- 1, 1]$ otherwise, we have that

\begin{matrix} \begin{array}{c} β_{j} - β_{j}^{c} + {(n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1} \\ \cdot (n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c}) = λ_{2} {(n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1} \frac{\partial | β_{j} |}{\partial β_{j}} . \end{array} \end{matrix}

Consequently, one can update β_j as

\begin{matrix} β_{j}^{c} \leftarrow sign (Δ_{j}^{c}) (| Δ_{j}^{c} | - λ_{2} {(n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1})_{+}, \end{matrix}

where $Δ_{j}^{c} = β_{j}^{c} - {(n^{- 1} \sum_{i = 1}^{n} p_{i}^{c} (1 - p_{i}^{c}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1} (n^{- 1} \sum_{i = 1}^{n} (p_{i}^{c} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l}^{c})$ .

We summarize the algorithm as follows.

Coordinate

Step 1. Initialization. Given initial value β.
Step 2. For t = 1, 2, ⋯, update β.
For j = 1, ⋯, p
- Compute p_i for 1 ≤ i ≤ n;
- Let $Δ_{j} = β_{j} - {(n^{- 1} \sum_{i = 1}^{n} p_{i} (1 - p_{i}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1} (n^{- 1} \sum_{i = 1}^{n} (p_{i} - 1) Y_{i} x_{i j} + 2 λ_{1} \sum_{l = 1}^{p q} k_{j l} β_{l})$ ;
- Update $β_{j} \leftarrow sign (Δ_{j}) (| Δ_{j} | - λ_{2} {(n^{- 1} \sum_{i = 1}^{n} p_{i} (1 - p_{i}) x_{i j}^{2} + 2 λ_{1} k_{j j})}^{- 1})_{+}$ ;
End for.
Step 3. Repeat Step 2 until convergence.

By the proposed algorithm, we can obtain the solution of (3), which is denoted by $\hat{β}$ . As the estimator for β, the theoretical properties of $\hat{β}$ are studied in the next subsection.

Theoretical properties

In this subsection, we consider the properties of $\hat{β}$ . A nonasymptotic error bound of $\hat{β}$ is given. We assume that the true value β* is sparse. Let $I^{*} = {1 \leq j \leq p q : β_{j}^{*} \neq 0}$ , and $k^{*} = \sum_{j = 1}^{p q} I (β_{j}^{*} \neq 0)$ be the cardinality of I*. For the purpose of theoretical studies, we make the following assumptions.

Assumption 1. Assume that there exists a constant L such that |x_ij| ≤ L for every 1 ≤ i ≤ n, 1 ≤ j ≤ pq.

Assumption 2. Assume that there exists a constant C such that ‖β*‖₁ ≤ C.

Assumption 3. For the matrix K, assume that there exists a constant C₀ such that λ_max(K) ≤ C₀, where λ_max(K) is the largest eigenvalue of K.

Assumption 4. Let $Σ = E (X_{i} X_{i}^{T})$ . Define the set $V_{α, ϵ} = {β \in R^{p q} : \sum_{j \notin I^{*}} | β_{j} | \leq α \sum_{j \in I^{*}} | β_{j} | + ϵ} for some α, ϵ .$ Assume that there exists a constant 0 < b ≤ 1 such that for every β ∈ V_{α, ϵ},

\begin{matrix} P (β^{T} Σ β \geq b \sum_{j \in I^{*}} β_{j}^{2} - ϵ) = 1 . \end{matrix}

Assumption 1 makes a common bound L for all x_ij with i = 1, ⋯, n, j = 1, ⋯, pq. Assumption 2 gives a bound for ‖β*‖₁. Combining Assumptions 1 and 2, one can make sure that P_i with 1 ≤ i ≤ n are bounded away from zero and one. P_i equalling zero or one will cause the i-th subject to be either ignorable or dominant in the likelihood function, that is not expected to appear in statistical analysis. This case can be avoided by Assumptions 1 and 2. In Assumption 3, we assume that the largest eigenvalue of K is bounded. Assumption 4 is called Condition Stabil, which can be regarded as a stability requirement on the correlation structure [30]. Under these assumptions, we have the following theorem.

Theorem 1 Assume that Assumptions 1-3 are true and Assumption 4 holds for $α = 5, ϵ = \frac{ln 2}{2^{d}} \times \frac{3}{λ_{2}}$ with d = max{pq, n}, let λ₁ = λ₂/(6CC₀), if

\begin{matrix} λ_{2} \geq 3 (7 L \sqrt{\frac{2 ln (2 d)}{n}} + \frac{L}{2 d} + 2 L \sqrt{\frac{- 2 ln δ}{n}}), \end{matrix}

(4)

then we have that

\begin{matrix} P (‖ \hat{β} - β^{*} ‖_{1} \leq \frac{3 k^{*} λ_{2}}{s b} + (1 + \frac{3 s}{λ_{2}}) ϵ) > 1 - δ, \end{matrix}

where s = (1 + e^A)⁻⁴ with A = 8CL is a constant.

The proof of Theorem 1 is put in the appendix section. The theorem shows that with a high probability, the L₁ norm of estimate error is bounded by 3k*λ₂/(sb) + (1 + 3s/λ₂)ϵ. One can see that the term (1 + 3s/λ₂)ϵ = O(d/2^d), which can be negligible for large d. Hence, the term 3k*λ₂/(sb) dominates the upper bound, which becomes larger when b becomes smaller. If further assume that ln(pq) = o(n), by the condition (4) one can see that λ₂ can tend to 0. Further 3k*λ₂/(sb) → 0, that means the upper bound can tend to zero. Consequently, the consistency of $\hat{β}$ can be guaranteed.

The selection of tuning parameters

The optimization function (3) contains two tuning parameters λ₁ and λ₂, which should be determined by some criteria, such as BIC, cross validation method. In our simulation studies, we select the tuning parameters by a validation set. And in real data analysis, the cross validation method is used. Before applying these methods, one should firstly determine the value range of tuning parameters. Specifically, we here make a transformation of λ₁ and λ₂. Let λ = λ₁ + λ₂, and α = λ₂/λ. Then the penalty terms in (3) can be rewritten as λ(α‖β‖₁ + (1 − α)β^T Kβ). Because α ∈ [0, 1], the alternative values of α are set as 0.02κ for κ = 1, ⋯, 50. With a given α, we denote λ₀ as the threshold value. Once λ ≥ λ₀, the solution of (3) is exactly zero. By

\begin{matrix} \frac{\partial Q (β)}{\partial β} = \frac{1}{n} \sum_{i = 1}^{n} \frac{e^{- Y_{i} β^{T} X_{i}}}{1 + e^{- Y_{i} β^{T} X_{i}}} (- Y_{i} X_{i}) + 2 λ (1 - α) K β + λ α \frac{{\partial ‖ β ‖}_{1}}{β} = 0, \end{matrix}

one can see that once β = 0 is the solution, every element of $1 / (2 λ α n) \sum_{i = 1}^{n} (- Y_{i} X_{i})$ belongs to [−1, 1]. This means that $λ_{0} = 1 / (2 α n) {‖ \sum_{i = 1}^{n} (Y_{i} X_{i}) ‖}_{\infty}$ . Following the idea of [14], the alternative values of λ are set as 0.001 and 0.96^νλ₀ for ν = 0, 1, ⋯, 160. For the validation set method, the prediction error on the validation set of our approach with tuning parameters α, λ is denoted by PE_{(α, λ)}. The final α, λ are selected as the minimizer of PE_{(α, λ)}.

For the M-fold cross validation method, the data are randomly divided into M folds of approximately equal size. For m = 1, ⋯, M, we treat the m fold as the validation set, and fit the model with tuning parameters α, λ on the remaining M − 1 flods. The corresponding prediction error on the validation set is denoted by $P E_{(α, λ)}^{(m)}$ and the cross validation prediction error is defined as

\begin{matrix} P E_{(α, λ)}^{(c v)} = \frac{1}{M} \sum_{m = 1}^{M} P E_{(α, λ)}^{(m)} . \end{matrix}

The α, λ are selected as the minimizer of the cross validation prediction error [6].

Numerical studies

In this section, we evaluate the performance of our proposed method by two simulated examples and a real data analysis. For the purpose of comparison, we also consider the logistic regression model with L₁ penalty [12, 13], the logistic regression model with fused lasso penalty, and linear support vector machine, which are denoted by LG-L₁, LG-fused, and Linear SVM respectively for convenience. Meanwhile, our proposed method is abbreviated as LG-sob.

Simulation studies

Example 1. We generate data from the following model

\begin{matrix} log \frac{P (Y_{i} = + 1 | X_{i})}{P (Y_{i} = - 1 | X_{i})} = < X_{i}, B_{0} >, i = 1, \dots, n, \end{matrix}

where X_i and B₀ both belong to $R^{32 \times 32}$ . One result caused by image covariates is that the corresponding regression coefficient can be treated as a image too. Hence we here just treat B₀ as images, while X_i is generated from a multivariate normal distribution. Specifically, we define the vectorization of X_i as X_i, and X_i is generated from a multivariate normal distribution with mean 0 and covariance $cov (x_{i j_{1}}, x_{i j_{2}}) = 0 . 5^{| j_{1} - j_{2} |}$ for any 1 ≤ j₁, j₂ ≤ 1024. The parameter image B₀ is considered in two cases, which have been shown in Fig 2. The first case of B₀ denoted by B₀₁ is a bird picture, in which the blue region takes value 0, and the yellow region takes value 1. The other case of B₀ denoted by B₀₂ is a butterfly picture, which is more complicated and takes values in interval [−0.0197, 0.0628]. Given X_i and B₀, the response Y_i is generated from a two-point distribution $P (Y_{i} = + 1 | X_{i}) = 1 - P (Y_{i} = - 1 | X_{i}) = 1 / (1 + e^{- < X_{i}, B_{0} >})$ .

Example 2. In this example, the mechanism of data generation is similar to that for Example 1, the only difference is that we generate X_i in a more complex way. In particular, we follow the simulation scheme of [22] and generate X_i from a 32 × 32 phantom map with 1024 pixels according to a spatially correlated random process X_i = ∑_l l⁻¹ η_il φ_l, among which the η_il are standard normal random variables and the φ_l are bivariate Haar wavelet basis functions.

For these two simulated examples, along with the training set with sample size n, we also generate a validation set and a test set with sample sizes both being 500. We train the model on the training set, select tuning parameters through the validation set, and calculate the classification accuracy on the test set to evaluate the performance of the model.

For every specification of the parameter B₀ and sample size n, we replicate the simulation 100 times for each example, and the average prediction errors are computed and summarized in Table 1 for Example 1, and Table 2 for Example 2 respectively. Besides the prediction errors, we also calculate the average estimation errors $\sum_{i = 1}^{100} {‖ {\hat{B}}_{i} - B_{0} ‖}^{2} / 100$ for LG-sob, LG-L₁ and LG-fused, where ${\hat{B}}_{i}$ is the parameter image estimator in the i-th time. From the results, one can see that our proposed method performs better than the other three methods in all cases from the prediction perspective. As sample size n becomes larger, the prediction errors will become smaller, but the estimation errors do not decrease congruously. The reason may be that the tuning parameters are selected based on minimization of prediction error.

Table 1. Results of simulated example 1: Prediction error (PE) and estimation error (EE).

(n, B₀)	(500, B₀₁)		(1000, B₀₁)		(500, B₀₂)		(1000, B₀₂)
	PE	EE	PE	EE	PE	EE	PE	EE
LG-sob	0.099	337.645	0.075	336.173	0.107	8.153	0.080	13.505
LG-L₁	0.272	404.589	0.199	375.926	0.272	17.354	0.204	23.143
LG-fused	0.248	423.242	0.190	406.866	0.248	7.016	0.190	9.400
Linear SVM	0.221	NA	0.174	NA	0.223	NA	0.172	NA

Open in a new tab

Table 2. Results of simulated example 2: Prediction error (PE) and estimation error (EE).

(n, B₀)	(500, B₀₁)		(1000, B₀₁)		(500, B₀₂)		(1000, B₀₂)
	PE	EE	PE	EE	PE	EE	PE	EE
LG-sob	0.028	181.06	0.023	176.19	0.049	582.94	0.038	1096.1
LG-L₁	0.073	1712.6	0.058	1501.7	0.116	2190.9	0.090	3022.4
LG-fused	0.052	438.57	0.044	495.73	0.097	482.19	0.076	775.99
Linear SVM	0.050	NA	0.038	NA	0.084	NA	0.071	NA

Open in a new tab

Moreover, we also randomly select one simulated result from the 100 replications of Example 1, and show the parameter image estimations in Fig 3. One can see that our proposed LG-sob method can capture the shapes of images, but LG-L₁ and LG-fused do not have this property.

A real data analysis

The classification of the ZIP Code Dataset is a benchmark problem in machine learning community [6]. One can obtain the ZIP Code Dataset from the following website https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html [28]. The Dataset contains normalized handwritten digits, which are automatically scanned from envelopes by the U.S. Postal Service. Every observation is a handwritten digit, and is represented as a size normalized 16 × 16 grayscale image [31]. The purpose is to use the 256 pixel values to predict the corresponding digit. The Dataset contains a training set with 7291 observations and a test set with 2007 observations. Because this article only considers the binary response prediction by logistic regression models, and it looks like that numbers 3 and 8 have more similar characteristics, hence we only consider handwritten 3’s and 8’s in this paper. The sizes of handwritten 3’s and 8’s are 658 and 542 respectively in the train set, while they are both 166 in the test set. Fig 4 shows some examples of handwritten 3’s and 8’s.

Fig 4 — Some examples of handwritten 3’s and 8’s.

More specifically, we denote the i-th observation by $X_{i} \in R^{16 \times 16}$ , and define the corresponding class label Y_i = −1 if X_i represents handwritten 3 and Y_i = + 1 if X_i represents handwritten 8. Our proposed method is applied to construct the classifier for the prediction of Y_i (i.e. handwritten numeral) based on the grayscale image X_i. We train the model on the training set, and evaluate the performance of the proposed method on the test set by classification accuracy. For the purpose of comparison, we also consider the logistic regression model with only L₁ penalty.

The tuning parameters are selected by 10-fold cross validation (CV) method. The CV prediction errors in various parameters setting are calculated and plotted in Fig 5. Finally, our proposed method selects the tuning parameters as α = 0.04, λ = 0.0118, while the method with L₁ penalty selects the tuning parameter as λ = 0.0014. The parameter image estimations of the two methods are shown in Fig 6. One can see that our proposed method tends to make adjacent pixels have similar effects on the model. Meanwhile, LG-L₁ tries to obtain a more sparse parameter estimation, and LG-fused method tries to make pixels only adjacent in the vertical direction have similar effects. The top-left region of parameter image has positive effects on handwritten numeral 8, and the bottom-right region has positive effects on handwritten numeral 3. The classification accuracy on the test set of our proposed method is 96.99%, while the accuracies of LG-L₁, LG-fused, and Linear SVM are 96.39%, 96.08% and 96.39%, respectively. The proposed method performs better.

Fig 5 — The results of 10-fold CV: Prediction error in various parameters settings.

Fig 6 — The parameter image estimations. Left: the estimation of our proposed LG-sob; Middle: the estimation of LG-L₁; Right: the estimation of LG-fused.

Conclusion

We have developed a novel estimation method for logistic regression with image covariates. This method can not only perform variable selection, but also capture the shape features of the parameter images. Both theoretical results and numerical studies show that our method performs well. We have proposed a coordinated descent algorithm to solve the optimization problem, and the global convergence of the algorithm is guaranteed. However, as pointed out by one referee, the coordinated descent algorithm is very time consuming, especially in the case of high dimensional image covariates. Many more efficient optimalization approaches, such as Nesterov accelerated gradient methods [32], interior-point methods [13], may be more suitable. We will research this issue in future. Furthermore, our method is mainly based on Sobolev norm regularization, compared to which total variation regularization is more sensitive to capture sharp edges and jumps of parameter images. However, the algorithm of total variation regularization based estimation method is more complex, which can be our future work to study.

Appendix: Proof of Theorem 1

Before giving the proof of Theorem 1, we first list the bounded differences inequality as the following lemma without proof.

Lemma 1 (the Bounded Differences Inequality) Suppose that $X_{1}, \dots, X_{n} \in H$ are independent, and the function $f : H^{n} \to R$ satisfies the bounded difference assumption

\begin{matrix} sup_{x_{1}, \dots, x_{n}, x_{i}^{^{'}} \in H} | f (x_{1}, \dots, x_{n}) - f (x_{1}, \dots, x_{i - 1}, x_{i}^{^{'}}, x_{i + 1}, \dots, x_{n}) | \leq c_{i}, \end{matrix}

for i = 1, ⋯, n. Then for all t > 0,

\begin{matrix} P (f - E (f) \geq t) \leq e^{- 2 t^{2} / \sum_{i = 1}^{n} c_{i}^{2}} . \end{matrix}

For more details about Lemma 1 and its proof, one can refer to [33]. The following is the proof of Theorem 1.

Proof of Theorem 1. By the definitions of $\hat{β}$ and β*, one can see that

\begin{matrix} P l (\hat{β}) \geq P l (β^{*}), \end{matrix}

and

\begin{matrix} P_{n} l (\hat{β}) + λ_{1} {\hat{β}}^{T} K \hat{β} + λ_{2} ‖ \hat{β} ‖_{1} \leq P_{n} l (β^{*}) + λ_{1} β^{* T} K β^{*} + λ_{2} {‖ β^{*} ‖}_{1} . \end{matrix}

Hence, we have that

\begin{matrix} 0 & \leq & P l (\hat{β}) - P l (β^{*}) \\ \leq & (P_{n} - P) (l (β^{*}) - l (\hat{β})) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) . \end{matrix}

(5)

Moreover,

\begin{matrix} λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} \\ \leq & λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} + λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) \\ \leq & λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} + λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + P l (\hat{β}) - P l (β^{*}) \\ \leq & λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} + λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) \\ + (P_{n} - P) (l (β^{*}) - l (\hat{β})) + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) . \end{matrix}

(6)

We first consider the term $(P_{n} - P) (l (β^{*}) - l (\hat{β}))$ . Specifically, define $L_{n} = {sup}_{β} \frac{(P_{n} - P) (l (β^{*}) - l (β))}{‖ β - β^{*} ‖_{1} + ϵ}$ . Let $l (β; Y_{i}, X_{i}) = log (1 + e^{- Y_{i} β^{T} X_{i}})$ , and

\begin{matrix} P_{n l} l (β) = \frac{1}{n} (\sum_{i = 1, i \neq l}^{n} l (β; Y_{i}, X_{i}) + l (β; Y_{l}^{^{'}}, X_{l}^{^{'}})), \end{matrix}

which is the empirical measure corresponding to replacing (Y_l, X_l) by $(Y_{l}^{^{'}}, X_{l}^{^{'}})$ . Then

\begin{matrix} \frac{(P_{n} - P) (l (β^{*}) - l (β))}{‖ β - β^{*} ‖_{1} + ϵ} - \frac{(P_{n l} - P) (l (β^{*}) - l (β))}{‖ β - β^{*} ‖_{1} + ϵ} \\ = & \frac{1}{n} \frac{l (β^{*}; Y_{l}, X_{l}) - l (β; Y_{l}, X_{l}) - l (β^{*}; Y_{l}^{^{'}}, X_{l}^{^{'}}) + l (β; Y_{l}^{^{'}}, X_{l}^{^{'}})}{‖ β - β^{*} ‖_{1} + ϵ} \\ \leq & \frac{4 L}{n} \frac{‖ β - β^{*} ‖_{1}}{‖ β - β^{*} ‖_{1} + ϵ} \leq \frac{4 L}{n}, \end{matrix}

among which the inequality is obtained by a first order Taylor expansion and the assumption 1. Then by Lemma 1, we can obtain that

\begin{matrix} P (L_{n} - E (L_{n}) \geq u) \leq exp {- \frac{n u^{2}}{8 L^{2}}} . \end{matrix}

Let $δ = exp {- \frac{n u^{2}}{8 L^{2}}}$ , then we have that $u = 2 L \sqrt{\frac{- 2 ln δ}{n}}$ , and P(L_n − E(L_n) ≥ u) ≤ δ.

Let d = max{pq, n}. Taking $ϵ = \frac{ln 2}{2^{d}} \times \frac{3}{λ_{2}}$ , by the lemma 3 of [34] with C_φ = 1, C_F = L, we have

\begin{matrix} E (L_{n}) \leq 7 L \sqrt{\frac{2 ln (2 d)}{n}} + \frac{L}{2 d} . \end{matrix}

Consequently, we have that

\begin{matrix} P (L_{n} \leq 7 L \sqrt{\frac{2 ln (2 d)}{n}} + \frac{L}{2 d} + 2 L \sqrt{\frac{- 2 ln δ}{n}}) \geq 1 - δ . \end{matrix}

By the condition of Theorem 1, we know $λ_{2} \geq 3 (7 L \sqrt{\frac{2 ln (2 d)}{n}} + \frac{L}{2 d} + 2 L \sqrt{\frac{- 2 ln δ}{n}})$ , hence P(L_n ≤ λ₂/3) ≥ 1 − δ.

On the event {L_n ≤ λ₂/3}, we have that

\begin{matrix} (P_{n} - P) (l (β^{*}) - l (\hat{β})) \leq \frac{λ_{2}}{3} (‖ β - β^{*} ‖_{1} + ϵ) . \end{matrix}

(7)

Secondly, we consider the term $λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β})$ . Based on Assumptions 2 and 3, one can see that

\begin{matrix} λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) \\ = & 2 λ_{1} β^{* T} K (β^{*} - \hat{β}) \\ \leq & 2 λ_{1} λ_{max} (K) ‖ β^{*} ‖_{2} {‖ \hat{β} - β^{*} ‖}_{2} \\ \leq & 2 λ_{1} λ_{max} (K) ‖ β^{*} ‖_{1} {‖ \hat{β} - β^{*} ‖}_{1} \\ \leq & 2 λ_{1} C C_{0} {‖ \hat{β} - β^{*} ‖}_{1} . \end{matrix}

One can see that if λ₁ = λ₂/(6CC₀), we have

\begin{matrix} λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) \leq λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} . \end{matrix}

(8)

Consequently, on the event {L_n ≤ λ₂/3} we combine (6), (7), (8), and obtain

\begin{matrix} λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} & \leq & λ_{2} ‖ \hat{β} - β^{*} ‖_{1} + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) + λ_{2} / 3 ϵ \\ \leq & λ_{2} (‖ \hat{β} ‖_{1} + ‖ β^{*} ‖_{1}) + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) + λ_{2} / 3 ϵ \\ \leq & 2 λ_{2} {‖ β^{*} ‖}_{1} + λ_{2} / 3 ϵ . \end{matrix}

(9)

Hence we have that

\begin{matrix} ‖ \hat{β} - β^{*} ‖_{1} \leq 6 {‖ β^{*} ‖}_{1} + ϵ \leq 7 C . \end{matrix}

(10)

By (9) one can also obtain that

\begin{matrix} ‖ \hat{β} - β^{*} ‖_{1} & \leq & 3 ‖ \hat{β} - β^{*} ‖_{1} + 3 (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) + ϵ \\ = & 3 (\sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + \sum_{j \notin I^{*}} | {\hat{β}}_{j} | + \sum_{j \in I^{*}} | β_{j}^{*} | - \sum_{j = 1}^{p q} | {\hat{β}}_{j} |) + ϵ \\ = & 3 (\sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + \sum_{j \in I^{*}} | β_{j}^{*} | - \sum_{j \in I^{*}} | {\hat{β}}_{j} |) + ϵ \\ \leq & 6 \sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + ϵ . \end{matrix}

Consequently, we have $\sum_{j \notin I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | \leq 5 \sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + ϵ$ . This means that $\hat{β} - β^{*} \in V_{5, ϵ}$ .

By the example 4.5 in [35], we have that $P l (\hat{β}) - P l (β^{*}) \geq E_{X} {(P (\hat{β}) - P (β^{*}))}^{2}$ , where P(β) = 1/(1+ e^{−X^Tβ}) and E_X(⋅) is the expectation with respect to the distribution of X. Using Taylor expansion, one can obtain that

\begin{matrix} P (\hat{β}) - P (β^{*}) = \frac{e^{{\tilde{β}}^{T} X}}{{(1 + e^{{\tilde{β}}^{T} X})}^{2}} X^{T} (\hat{β} - β^{*}), \end{matrix}

where $\tilde{β} = τ \hat{β} + (1 - τ) β^{*}$ for some τ ∈ (0, 1). Moreover, by (10) and Assumptions 1-2, we have ${\tilde{β}}^{T} X \leq ‖ \tilde{β} ‖_{1} L \leq (τ ‖ \hat{β} - β^{*} ‖_{1} + ‖ β^{*} ‖_{1}) L \leq 8 C L .$ This means that

\begin{matrix} {(P (\hat{β}) - P (β^{*}))}^{2} \geq s {(\hat{β} - β^{*})}^{T} X X^{T} (\hat{β} - β^{*}), \end{matrix}

where s = (1 + e^A)⁻⁴ and A = 8CL, then by Assumption 4 we have that

\begin{matrix} P l (\hat{β}) - P l (β^{*}) \geq E_{X} {(P (\hat{β}) - P (β^{*}))}^{2} \geq s {(\hat{β} - β^{*})}^{T} Σ (\hat{β} - β^{*}) \geq s b \sum_{j \in I^{*}} {(\hat{β_{j}} - β_{j}^{*})}^{2} - s ϵ . \end{matrix}

(11)

Furthermore, we have

\begin{matrix} \frac{λ_{2}}{3} {‖ \hat{β} - β^{*} ‖}_{1} + s b \sum_{j \in I^{*}} {(\hat{β_{j}} - β_{j}^{*})}^{2} - s ϵ \\ \leq & \frac{λ_{2}}{3} {‖ \hat{β} - β^{*} ‖}_{1} + P l (\hat{β}) - P l (β^{*}) \\ \leq & λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) + (P_{n} - P) (l (β^{*}) - l (\hat{β})) \\ + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) \\ \leq & λ_{2} / 3 {‖ \hat{β} - β^{*} ‖}_{1} + λ_{1} {(\hat{β} - β^{*})}^{T} K (\hat{β} - β^{*}) + λ_{1} (β^{* T} K β^{*} - {\hat{β}}^{T} K \hat{β}) \\ + (P_{n} - P) (l (β^{*}) - l (\hat{β})) + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) \\ \leq & λ_{2} ‖ \hat{β} - β^{*} ‖_{1} + λ_{2} (‖ β^{*} ‖_{1} - ‖ \hat{β} ‖_{1}) + λ_{2} / 3 ϵ \\ = & λ_{2} (\sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + \sum_{j \notin I^{*}} | \hat{β_{j}} |) + λ_{2} (\sum_{j \in I^{*}} | β_{j}^{*} | - \sum_{j \in I^{*}} | \hat{β_{j}} | - \sum_{j \notin I^{*}} | \hat{β_{j}} |) \\ + λ_{2} / 3 ϵ \\ = & λ_{2} \sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + λ_{2} (\sum_{j \in I^{*}} | β_{j}^{*} | - \sum_{j \in I^{*}} | \hat{β_{j}} |) + λ_{2} / 3 ϵ \\ \leq & 2 λ_{2} \sum_{j \in I^{*}} | {\hat{β}}_{j} - β_{j}^{*} | + λ_{2} / 3 ϵ, \end{matrix}

where the first inequality follows by (11), the second inequality follows by (5), and the fourth inequality is obtained by combining the results of (7) and (8). Consequently, we have

\begin{matrix} ‖ \hat{β} - β^{*} ‖_{1} + \frac{3}{λ_{2}} s b \sum_{j \in I^{*}} {(\hat{β_{j}} - β_{j}^{*})}^{2} \leq 6 \sum_{j \in I^{*}} | \hat{β_{j}} - β_{j}^{*} | + (1 + \frac{3 s}{λ_{2}}) ϵ \\ \leq 9 a k^{*} + \frac{1}{a} \sum_{j \in I^{*}} {(\hat{β_{j}} - β_{j}^{*})}^{2} + (1 + \frac{3 s}{λ_{2}}) ϵ, \end{matrix}

where a is a positive constant and the second inequality follows by

\begin{matrix} 6 | \hat{β_{j}} - β_{j}^{*} | = 2 \cdot 3 a^{1 / 2} \cdot a^{- 1 / 2} | \hat{β_{j}} - β_{j}^{*} | \leq 6 a + {(\hat{β_{j}} - β_{j}^{*})}^{2} / a . \end{matrix}

Let a = λ₂/(3sb), then

\begin{matrix} ‖ \hat{β} - β^{*} ‖_{1} \leq \frac{3 k^{*} λ_{2}}{s b} + (1 + \frac{3 s}{λ_{2}}) ϵ . \end{matrix}

This completes the proof of the Theorem.

Acknowledgments

We thank the Editor, the AE and four referees for their helpful comments and valuable suggestions, which make the article have a greatly improvement.

Data Availability

The real ZIP Code Dataset from the following website: https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html.

Funding Statement

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2007. [Google Scholar]
2. Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press; 2016. [Google Scholar]
3. Cox DR. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society Series B (Methodological). 1958;20(2):215–242. 10.1111/j.2517-6161.1958.tb00292.x [DOI] [Google Scholar]
4. Albert A, Anderson JA. On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1984;71(1):1–10. 10.1093/biomet/71.1.1 [DOI] [Google Scholar]
5. Santner TJ, Duffy DE. A Note on A. Albert and J. A. Anderson’s Conditions for the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1986;73(3):755–758. 10.1093/biomet/73.3.755 [DOI] [Google Scholar]
6. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. 2nd ed Springer; 2009. Available from: http://www-stat.stanford.edu/~tibs/ElemStatLearn/. [Google Scholar]
7. Silvapulle MJ. On the Existence of Maximum Likelihood Estimators for the Binomial Response Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1981;43(3):310–313. [Google Scholar]
8.Sun Y. Regularization in High-dimensional Statistics. PhD dissertation stanford university. 2015.
9. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1996;58(1):267–288. [Google Scholar]
10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001;96(456):1348–1360. 10.1198/016214501753382273 [DOI] [Google Scholar]
11. Zhang C. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38(2):894–942. 10.1214/09-AOS729 [DOI] [Google Scholar]
12. Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(4):659–677. 10.1111/j.1467-9868.2007.00607.x [DOI] [Google Scholar]
13. Koh K, Kim S, Boyd SP. An Interior-Point Method for Large-Scale l₁-Regularized Logistic Regression. Journal of Machine Learning Research. 2007;8:1519–1555. [Google Scholar]
14. Meier L, De Geer SV, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):53–71. 10.1111/j.1467-9868.2007.00627.x [DOI] [Google Scholar]
15. Liang Y, Liu C, Luan X, Leung K, Chan T, Xu Z, et al. Sparse logistic regression with a L_1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013;14:198 10.1186/1471-2105-14-198 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Polzehl J, Spokoiny V. Propagation-Separation Approach for Local Likelihood Estimation. Probability Theory and Related Fields. 2006;135(3):335–362. 10.1007/s00440-005-0464-1 [DOI] [Google Scholar]
17. Zhu H, Fan J, Kong L. Spatially Varying Coefficient Model for Neuroimaging Data With Jump Discontinuities. Journal of the American Statistical Association. 2014;109(507):1084–1098. 10.1080/01621459.2014.881742 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Guo R, Ahn M, Zhu H. Spatially Weighted Principal Component Analysis for Imaging Classification. Journal of Computational and Graphical Statistics. 2015;24(1):274–296. 10.1080/10618600.2014.912135 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Polzehl J, Spokoiny V. Adaptive Weights Smoothing with Applications to Image Restoration. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2000;62(2):335–354. 10.1111/1467-9868.00235 [DOI] [Google Scholar]
20. Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG. Multiscale adaptive regression models for neuroimaging data. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2011;73(4):559–578. 10.1111/j.1467-9868.2010.00767.x [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Ariascastro E, Salmon J, Willett R. Oracle inequalities and minimax rates for non-local means and related adaptive kernel-based methods. SIAM Journal on Imaging Sciences. 2012;5(3):944–992. 10.1137/110859403 [DOI] [Google Scholar]
22. Wang X, Zhu H, for the Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association. 2017;112(519):1156–1168. 10.1080/01621459.2016.1194846 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Hu W, Shen W, Zhou H, Kong D. Matrix Linear Discriminant Analysis. Technometrics. 2019;0(0):1–10. [PMC free article] [PubMed] [Google Scholar]
24.Peyré G. Denoising by Sobolev and Total Variation Regularization. http://www.numerical-tours.com/matl#ab/denoisingsimp_4_denoiseregul/; 2019.
25. Qiu P. Image Processing and Jump Regression Analysis. Wiley-Interscience; 2005. [Google Scholar]
26. Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(2):301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
27. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(1):91–108. 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]
28. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC; 2015. [Google Scholar]
29. Tseng P. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. Journal of Optimization Theory and Applications. 2001;109(3):475–494. 10.1023/A:1017501703105 [DOI] [Google Scholar]
30. Bunea F. Honest variable selection in linear and logistic regression models via l₁ and l₁ + l₂ penalization. Electronic Journal of Statistics. 2008;2:1153–1194. 10.1214/08-EJS287 [DOI] [Google Scholar]
31. Lecun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems. 1989; p. 396–404. [Google Scholar]
32. Yu N. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140(1):125–161. 10.1007/s10107-012-0629-5 [DOI] [Google Scholar]
33. Devroye L, Lugosi G. Combinatorial Methods in Density Estimation. Springer; 2001. [Google Scholar]
34. Wegkamp M. Lasso type classifiers with a reject option. Electronic Journal of Statistics. 2007;1(3):155–168. 10.1214/07-EJS058 [DOI] [Google Scholar]
35. Steinwart I. How to Compare Different Loss Functions and Their Risks. Constructive Approximation. 2007;26(2):225–287. 10.1007/s00365-006-0662-3 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The real ZIP Code Dataset from the following website: https://web.stanford.edu/~hastie/StatLearnSparsity_files/DATA/zipcode.html.

[pone.0234975.ref001] 1. Bishop CM. Pattern Recognition and Machine Learning. Springer; 2007. [Google Scholar]

[pone.0234975.ref002] 2. Goodfellow I, Bengio Y, Courville A. Deep Learning. The MIT Press; 2016. [Google Scholar]

[pone.0234975.ref003] 3. Cox DR. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society Series B (Methodological). 1958;20(2):215–242. 10.1111/j.2517-6161.1958.tb00292.x [DOI] [Google Scholar]

[pone.0234975.ref004] 4. Albert A, Anderson JA. On the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1984;71(1):1–10. 10.1093/biomet/71.1.1 [DOI] [Google Scholar]

[pone.0234975.ref005] 5. Santner TJ, Duffy DE. A Note on A. Albert and J. A. Anderson’s Conditions for the Existence of Maximum Likelihood Estimates in Logistic Regression Models. Biometrika. 1986;73(3):755–758. 10.1093/biomet/73.3.755 [DOI] [Google Scholar]

[pone.0234975.ref006] 6. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference and prediction. 2nd ed Springer; 2009. Available from: http://www-stat.stanford.edu/~tibs/ElemStatLearn/. [Google Scholar]

[pone.0234975.ref007] 7. Silvapulle MJ. On the Existence of Maximum Likelihood Estimators for the Binomial Response Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1981;43(3):310–313. [Google Scholar]

[pone.0234975.ref008] 8.Sun Y. Regularization in High-dimensional Statistics. PhD dissertation stanford university. 2015.

[pone.0234975.ref009] 9. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1996;58(1):267–288. [Google Scholar]

[pone.0234975.ref010] 10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001;96(456):1348–1360. 10.1198/016214501753382273 [DOI] [Google Scholar]

[pone.0234975.ref011] 11. Zhang C. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 2010;38(2):894–942. 10.1214/09-AOS729 [DOI] [Google Scholar]

[pone.0234975.ref012] 12. Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2007;69(4):659–677. 10.1111/j.1467-9868.2007.00607.x [DOI] [Google Scholar]

[pone.0234975.ref013] 13. Koh K, Kim S, Boyd SP. An Interior-Point Method for Large-Scale l₁-Regularized Logistic Regression. Journal of Machine Learning Research. 2007;8:1519–1555. [Google Scholar]

[pone.0234975.ref014] 14. Meier L, De Geer SV, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(1):53–71. 10.1111/j.1467-9868.2007.00627.x [DOI] [Google Scholar]

[pone.0234975.ref015] 15. Liang Y, Liu C, Luan X, Leung K, Chan T, Xu Z, et al. Sparse logistic regression with a L_1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013;14:198 10.1186/1471-2105-14-198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref016] 16. Polzehl J, Spokoiny V. Propagation-Separation Approach for Local Likelihood Estimation. Probability Theory and Related Fields. 2006;135(3):335–362. 10.1007/s00440-005-0464-1 [DOI] [Google Scholar]

[pone.0234975.ref017] 17. Zhu H, Fan J, Kong L. Spatially Varying Coefficient Model for Neuroimaging Data With Jump Discontinuities. Journal of the American Statistical Association. 2014;109(507):1084–1098. 10.1080/01621459.2014.881742 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref018] 18. Guo R, Ahn M, Zhu H. Spatially Weighted Principal Component Analysis for Imaging Classification. Journal of Computational and Graphical Statistics. 2015;24(1):274–296. 10.1080/10618600.2014.912135 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref019] 19. Polzehl J, Spokoiny V. Adaptive Weights Smoothing with Applications to Image Restoration. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2000;62(2):335–354. 10.1111/1467-9868.00235 [DOI] [Google Scholar]

[pone.0234975.ref020] 20. Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG. Multiscale adaptive regression models for neuroimaging data. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2011;73(4):559–578. 10.1111/j.1467-9868.2010.00767.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref021] 21. Ariascastro E, Salmon J, Willett R. Oracle inequalities and minimax rates for non-local means and related adaptive kernel-based methods. SIAM Journal on Imaging Sciences. 2012;5(3):944–992. 10.1137/110859403 [DOI] [Google Scholar]

[pone.0234975.ref022] 22. Wang X, Zhu H, for the Alzheimer’s Disease Neuroimaging Initiative. Generalized Scalar-on-Image Regression Models via Total Variation. Journal of the American Statistical Association. 2017;112(519):1156–1168. 10.1080/01621459.2016.1194846 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref023] 23. Hu W, Shen W, Zhou H, Kong D. Matrix Linear Discriminant Analysis. Technometrics. 2019;0(0):1–10. [PMC free article] [PubMed] [Google Scholar]

[pone.0234975.ref024] 24.Peyré G. Denoising by Sobolev and Total Variation Regularization. http://www.numerical-tours.com/matl#ab/denoisingsimp_4_denoiseregul/; 2019.

[pone.0234975.ref025] 25. Qiu P. Image Processing and Jump Regression Analysis. Wiley-Interscience; 2005. [Google Scholar]

[pone.0234975.ref026] 26. Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(2):301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

[pone.0234975.ref027] 27. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and Smoothness via the Fused Lasso. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2005;67(1):91–108. 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]

[pone.0234975.ref028] 28. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC; 2015. [Google Scholar]

[pone.0234975.ref029] 29. Tseng P. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. Journal of Optimization Theory and Applications. 2001;109(3):475–494. 10.1023/A:1017501703105 [DOI] [Google Scholar]

[pone.0234975.ref030] 30. Bunea F. Honest variable selection in linear and logistic regression models via l₁ and l₁ + l₂ penalization. Electronic Journal of Statistics. 2008;2:1153–1194. 10.1214/08-EJS287 [DOI] [Google Scholar]

[pone.0234975.ref031] 31. Lecun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems. 1989; p. 396–404. [Google Scholar]

[pone.0234975.ref032] 32. Yu N. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140(1):125–161. 10.1007/s10107-012-0629-5 [DOI] [Google Scholar]

[pone.0234975.ref033] 33. Devroye L, Lugosi G. Combinatorial Methods in Density Estimation. Springer; 2001. [Google Scholar]

[pone.0234975.ref034] 34. Wegkamp M. Lasso type classifiers with a reject option. Electronic Journal of Statistics. 2007;1(3):155–168. 10.1214/07-EJS058 [DOI] [Google Scholar]

[pone.0234975.ref035] 35. Steinwart I. How to Compare Different Loss Functions and Their Risks. Constructive Approximation. 2007;26(2):225–287. 10.1007/s00365-006-0662-3 [DOI] [Google Scholar]

PERMALINK

Logistic regression with image covariates via the combination of L₁ and Sobolev regularizations

Baiguo An

Beibei Zhang

Roles

Abstract

Introduction

Methodology

Model setup

Estimation

Fig 1. The matrix D in the case p = q = 3.

Algorithm

Theoretical properties

The selection of tuning parameters

Numerical studies

Simulation studies

Fig 2. Simulated example.

Table 1. Results of simulated example 1: Prediction error (PE) and estimation error (EE).

Table 2. Results of simulated example 2: Prediction error (PE) and estimation error (EE).

Fig 3. Simulated example.

A real data analysis

Fig 4. Real data analysis.

Fig 5. Real data analysis.

Fig 6. Real data analysis.

Conclusion

Appendix: Proof of Theorem 1

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Logistic regression with image covariates via the combination of L1 and Sobolev regularizations

Baiguo An

Beibei Zhang

Roles

Abstract

Introduction

Methodology

Model setup

Estimation

Fig 1. The matrix D in the case p = q = 3.

Algorithm

Theoretical properties

The selection of tuning parameters

Numerical studies

Simulation studies

Fig 2. Simulated example.

Table 1. Results of simulated example 1: Prediction error (PE) and estimation error (EE).

Table 2. Results of simulated example 2: Prediction error (PE) and estimation error (EE).

Fig 3. Simulated example.

A real data analysis

Fig 4. Real data analysis.

Fig 5. Real data analysis.

Fig 6. Real data analysis.

Conclusion

Appendix: Proof of Theorem 1

Acknowledgments

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Logistic regression with image covariates via the combination of L₁ and Sobolev regularizations