Generalized Scalar-on-Image Regression Models via Total Variation

Xiao Wang; Hongtu Zhu

doi:10.1080/01621459.2016.1194846

. Author manuscript; available in PMC: 2017 Nov 17.

Published in final edited form as: J Am Stat Assoc. 2017 Apr 13;112(519):1156–1168. doi: 10.1080/01621459.2016.1194846

Generalized Scalar-on-Image Regression Models via Total Variation

Xiao Wang ¹, Hongtu Zhu ², for the Alzheimer’s Disease Neuroimaging Initiative

PMCID: PMC5693263 NIHMSID: NIHMS848317 PMID: 29151658

Abstract

The use of imaging markers to predict clinical outcomes can have a great impact in public health. The aim of this paper is to develop a class of generalized scalar-on-image regression models via total variation (GSIRM-TV), in the sense of generalized linear models, for scalar response and imaging predictor with the presence of scalar covariates. A key novelty of GSIRM-TV is that it is assumed that the slope function (or image) of GSIRM-TV belongs to the space of bounded total variation in order to explicitly account for the piecewise smooth nature of most imaging data. We develop an efficient penalized total variation optimization to estimate the unknown slope function and other parameters. We also establish nonasymptotic error bounds on the excess risk. These bounds are explicitly specified in terms of sample size, image size, and image smoothness. Our simulations demonstrate a superior performance of GSIRM-TV against many existing approaches. We apply GSIRM-TV to the analysis of hippocampus data obtained from the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset.

Keywords: Excess risk, Functional regression, Generalized scalar-on-image regression, Prediction, Total variation

1 Introduction

The aim of this paper is to develop generalized scalar-on-image regression models via total variation (GSIRM-TV) with scalar response and imaging and/or scalar predictors. This new development is motivated by studying the predictive value of ultra-high dimensional imaging data and/or other scalar predictors (e.g., cognitive score) for clinical outcomes including diagnostic status and the response to treatment in the study of neurodegenerative and neuropsychiatric diseases, such as Alzheimer’s disease (AD)(Mu and Gage 2011). For instance, the growing public threat of AD has raised the urgency to discover and validate prognostic biomarkers that may identify subjects at greatest risk for future cognitive decline and accelerate the testing of preventive strategies. In this regard, prior studies of subjects at risk for AD have examined the utility of various individual biomarkers, such as cognitive tests, fluid markers, imaging measurements, or some individual genetic markers (e.g., APOE4 gene), to capture the heterogeneity and multifactorial complexity of AD (reviewed in Weiner et al. 2012).

Our GSIRM-TV considers the use of imaging predictor X and/or scalar predictors Z to predict scalar response Y. In practice, imaging data are often represented in the form of 2-dimensional matrix or 3-dimensional array. Assume that X ∈ ℝ^N^×^N is a 2-dimensional matrix of size N × N which is observed without error and Z ∈ ℝ^p is a p × 1 vector with the first component being constant one. Our GSIRM-TV assumes that Y given (X, Z) follows

Y ∣ (X, Z) ~ Exponential Family (μ, ϕ) and g (μ) = θ_{0}^{T} Z + 〈 X, β_{0} 〉,

(1)

where μ and ϕ are, respectively, canonical and scale parameters, 〈U, V〉 = Σ_i_,_ju_i_,_jv_i_,_j for U = (u_i_,_j) ∈ ℝ^N^×^N and V = (v_i_,_j) ∈ ℝ^N^×^N, and g(·) is a known link function. Moreover, θ₀ and β₀(·) are unknown parameters of interest and β₀(·) is called the coefficient image/function. Throughout the paper, assume that images are observed without error. We may deal with such measurement errors in images by applying some smoothing methods to reduce error in images (Li et al. 2010).

GSIRM-TV can be regarded as an extension of the well-known functional linear model (FLM) and the high-dimensional linear model (HLM) that have been extensively studied in the literature. If we regard 〈U, V〉 as an approximation of a two-dimensional integral, then GSIRM-TV is an approximated version of FLM. The literature on FLM is too vast to summarize here. Please see the well-known monographs Ramsay and Silverman (2005) and Ferraty and Vieu (2006). The functional principal component analysis (fPCA) and various penalization methods have been developed to estimate the coefficient function. For example, the fPCA method has been discussed by James (2002), Müller and Stadtmüller (2005), Hall and Horowitz (2007), Reiss and Ogden (2007, 2010), James et al. (2009), and Goldsmith et al. (2010) and the penalized method has been studied by Crambes et al. (2009), Yuan and Cai (2010), and Du and Wang (2014). On the other hand, if we vectorize X and β₀(·) as N² × 1 vectors, model (1) takes the form of the high dimensional generalized linear regression. To achieve sparsity in β₀, various penalization methods, such as Lasso or SCAD, have been developed. Please see Tibshirani (1996), Chen et al. (1998), Fan and Li (2001), and references therein.

Compared with FLM and HLM, a key novelty of GSIRM-TV is that the coefficient image β₀(·) in model (1) is assumed to be a piecewise smooth image with unknown jumps and edges. Such assumption not only has been widely used in the imaging literature, but also is critical for addressing various scientific questions, such as the identification of brain regions associated with AD. As an illustration, we consider a data set with n = 300 subjects simulated from a functional linear model which is a special case of (1). The first row of Figure 1 presents the true 64 × 64 image matrix β₀, X, and Y from the left to the right. We have vectorized X, used fPCA for FLM, Lasso for HLM, and GSIRM-TV to estimate the coefficient image and presented the estimated coefficient images in the second row of Figure 1. Unfortunately, both FLM and HLM fail to capture the main feature of the true coefficient image due to their key limitations. First, fPCA requires that β₀ be well presented by the eigenfunctions of X, whereas it is not the case according to Figure 1. Second, the existing regularization methods can have difficulty in recovering β₀, since the true coefficient image is non-sparse. Moreover, most regularization methods for FLM assume that the unknown coefficient function is one-dimensional and belongs to a smoothed function space, such as the Sobolev space, and thus they will not be able to preserve edge and boundary information for the data set presented in Figure 1. In contrast, our GSIRM-TV estimate developed in this paper can truly preserve the sharp edge of the original image.

Results from a simulated data set. The top row includes the true 64 × 64 coefficient image β₀ in the left panel, one realization of a 64 × 64 image predictor X in the middle panel, and the responses Y from n = 300 in the right panel. The bottom row includes the estimated coefficient functions obtained from fPCA (left), Lasso (middle), and Total Variation (right).

In this paper, we make two important contributions including a new estimation method based on the total variation analysis and non-asymptotic error bounds on the risk under the framework of GSIRM-TV. The total variation analysis plays a fundamental role in various image analyses since the path-breaking works of Rudin and Osher (1994) and Rudin, Osher and Fatemi (1992). The total variation penalty has been proved to be quite efficient for preserving the boundaries and edges of images (Rudin et al. 1992). Michel et al. (2011) proposed a similar total variation method for image regression and image classification, but they focus on the development of different algorithms for the TV optimization problem. According to the best of our knowledge, this is the first paper on the development of statistical analysis of the total variation method for GSIRM-TV. The fused lasso (Tibshirani et al. 2005; Friedman et al. 2007) uses a similar penalty function. But for the 2-dimensional parameter, the fused lasso and the TV penalty can be quite different. For example, the isotropic total variation penalty uses the Euclidean norm of the first differences of the parameter, rather than the sum of the absolute values of the first differences. There are a few papers on the use of two-dimensional or three-dimensional imaging predictors in FLM (Guillas and Lai 2010; Reiss and Ogden 2010; Zhou et al. 2013; James, et al. 2009; Goldsmith et al. 2010; Gertheiss et al. 2013; Wang et al. 2014; Reiss et al. 2015), but none of them consider the piecewisely smoothed function with jumps and edges and the total variation analysis. We also derive nonasymptotic error bounds on the risk for the estimated coefficient image under the total variation penalty. We are able to obtain finite-sample bounds that are specified explicitly in terms of the sample size n, the image size N × N, and the image smoothness.

The rest of the paper is organized as follows. Section 2 considers linear scalar-on-image regression model and proposes the TV optimization framework to estimate the unknown coefficient image. We also establish the nonasymptotic error bound for the prediction error. Section 3 extends linear scalar-on-image regression model to generalized scalar-on-image regression models. Section 4 examines the finite-sample performance of GSIRM-TV and compares it with several state-of-the-art methods, such as regularized matrix regression (Zhou and Li 2014). Section 5 applies GSIRM-TV to the use of the hippocampus imaging data for a binary classification problem. Future research directions are discussed in Section 6. The technical proofs of main theorems are given in the Appendix.

2 Linear scalar-on-image regression model

We start with considering a linear scalar-on-image regression model, which is the simplest case of GSIRM-TV (1), as follows:

Y = 〈 X, β_{0} 〉 + ε,

(2)

where ε is the random error with 𝔼(ε|X) = 0 and 𝔼(ε²|X) = σ², and without loss of generality, both X and Y are assumed to be centered with 𝔼(Y) = 𝔼(X) = 0. Model (2) may be treated as a special case of FLM since discrete images are isometric to the space of piecewise-constant functions defined as

X = {x \in L^{2} (Ω) : x (u, v) = N X_{j k}, \frac{j - 1}{N} \leq u < \frac{j}{N}, \frac{k - 1}{N} \leq v < \frac{k}{N} for 1 \leq j, k \leq N},

where X_jk is the (j, k)–th pixel value of the image X and Ω = [0, 1]². By treating β₀ as an integrable function in Ω, that is, β₀ ∈ L²(Ω), model (2) can be rewritten as

Y = \int_{0}^{1} \int_{0}^{1} x (u, v) β_{0} (u, v) dudv + ε .

2.1 The space of bounded variation

Throughout the paper, it is assumed that β₀ is a function of bounded variation in Ω if the total variation of β₀ in Ω, denoted by ||β₀||_TV, is finite and defined as follows:

{‖ β_{0} ‖}_{T V} = sup {\int_{Ω} β_{0} (u, v) div f (u, v) dudv : f \in C_{c}^{\infty} (Ω; R^{2}), {∣ f ∣}_{\infty} \leq 1},

where |f|_∞ = sup₍_u_,_v_)∈Ω |f(u, v)| and $C_{c}^{\infty} (Ω; R^{2})$ denotes the vector field with value in R², which is infinitely differentiable and has compact support in Ω. Moreover, f(u, v) = (f₁(u, v), f₂(u, v)) and div f(u, v) = ∂_uf₁(u, v) + ∂_vf₂(u, v), where ∂_u = ∂/∂u and ∂_v = ∂/∂v. The vector space of functions of bounded variation in Ω is denoted by BV(Ω). For example, if β₀ is differentiable in Ω, then ||β₀||_TV reduces to $\int_{Ω} \sqrt{{(\partial_{u} β_{0})}^{2} + {(\partial_{v} β_{0})}^{2}} dudv$ . In this case, β₀ belongs to the Sobolev space W^1,1(𝒟), i.e., functions with integrable first order partial derivatives. However, the power of total variation in image analysis arises exactly from the relaxation of such constraints. The BV(Ω) is much larger than W^1,1(𝒟) and contains many interesting piecewise continuous functions with jumps and edges. This is exactly the advantage of using TV regularization over other familiar regularization methods used in the nonparametric literature. For example, the smoothing spline penalty term is not sensitive enough to capture sharp edges and jumps.

There are at least two additional advantages of using bounded variation functions in model (2). First, many real images with edges have small total variation since image edges usually reside in a low-dimensional subset of pixels. As an illustration, in Figure 2, the left panel displays the Shepp-Logan phantom image, while the middle and right panels show the two components of the discrete gradient of the phantom image, which have obvious sparse patterns. Second, BV(Ω) is mathematically tractable even though it contains many more functions with edges and jumps compared with W^1,1(𝒟).

Left: the Shepp-Logan phantom image; Middle and Right: the two components of the discrete gradient of the phantom image.

2.2 Estimation

On the basis of model (2) and BV(Ω), we propose to solve the following TV minimization:

\begin{array}{l} minimize & {‖ β ‖}_{T V} \\ subject to & \sum_{i = 1}^{n} {(Y_{i} - 〈 X_{i}, β 〉)}^{2} \leq λ^{2}, \end{array}

(3)

where λ is a smoothing parameter, which controls the noise level. It is known that the above minimization problem is equivalent to the penalized optimization

\sum_{i = 1}^{n} {(Y_{i} - 〈 X_{i}, β 〉)}^{2} + \overset{⌣}{λ} {‖ β ‖}_{T V},

(4)

where λ̌ is a different smoothing parameter. The TV optimization has been widely used to reconstruct images in the compressive sensing literature (see e.g., Candès et al. 2006a; Candès et al. 2006b; Needell and Ward 2013). Using the TV optimization for one-dimensional regression has been studied by Mammen and van de Geer (1997) and Tibshirani (2014). Michel et al. (2011) discussed some algorithms to solve a similar optimization problem. To the best of our knowledge, nothing has been done on the statistical properties of the TV estimator for scalar-on-image regression models.

To solve the TV minimization (3) (or (4)), we treat β = (β_jk) ∈ ℝ^N^×^N as an N × N block of pixels with β_jk as its (j, k) element. Then, we define the discrete total variation of β = (β_jk) ∈ ℝ^N^×^N. For any β ∈ BV(Ω), the discrete gradient ∇ : BV(Ω) → ℝ^N^×^N^×2 is defined by

{(\nabla β)}_{j k} = {\begin{cases} (β_{j + 1, k} - β_{j k}, β_{j, k + 1} - β_{j k}), & 1 \leq j, k \leq N - 1, \\ (0, β_{j, k + 1} - β_{j k}), & j = N, 1 \leq k \leq N - 1, \\ (β_{j + 1, k} - β_{j k}, 0), & 1 \leq j \leq N - 1, k = N, \\ (0, 0), & k = j = N . \end{cases}

Based on (∇β)_jk = ((∇β)_jk_,1, (∇β)_jk_,2)^T, the anisotropic version of the total variation norm ||β||_TV can be rewritten as

{‖ β ‖}_{T V}^{aniso} = {‖ \nabla β ‖}_{1} = \sum_{j k} {∣ {(\nabla β)}_{j k, 1} ∣ + ∣ {(\nabla β)}_{j k, 2} ∣} .

On the other hand, its isotropic version is defined by

{‖ β ‖}_{T V}^{iso} = \sum_{j k} {‖ {(\nabla β)}_{j k} ‖}_{2} = \sum_{j k} \sqrt{{(\nabla β)}_{j k, 1}^{2} + {(\nabla β)}_{j k, 2}^{2}} .

The anisotropic and isotropic induced total variation norms are equivalent up to a factor of $\sqrt{2}$ , i.e.,

\frac{1}{\sqrt{2}} {‖ β ‖}_{T V}^{iso} \leq {‖ β ‖}_{T V}^{aniso} \leq \sqrt{2} {‖ β ‖}_{T V}^{iso} .

We will write all results in terms of the anisotropic total variation seminorm, but our results also extend to the isotropic version.

Let A_X be an n × N² design matrix such that the ith row is the vectorized X_i. With a slight abuse of notation, we use β to denote the coefficient matrix and its corresponding vector. We may rewrite (3) as the matrix form given by

\hat{β} = arg min {‖ β ‖}_{T V} subject to {‖ Y - A_{X} β ‖}_{2} \leq λ .

(5)

We adapt an algorithm called TVAL3 based on the augmented Lagrangian method (Hestenes 1969; Powell 1969; Li 2013). Specifically, we solve an equivalent optimization problem given by

min_{w, β} \sum_{l = 1}^{N^{2}} {‖ w_{l} ‖}_{1} subject to {‖ Y - A_{X} β ‖}_{2} \leq λ and D_{l} β = w_{l} for all l,

where D_l is an 2 × N² vector of constants associated with the discrete gradient. As an illustration, we consider a simple case with N = 2. In this case, we have β = (β₁₁, β₁₂, β₂₁, β₂₂)^T. We may choose

\begin{array}{l} D_{1} = [\begin{array}{l} - 1 & 1 & 0 & 0 \\ - 1 & 0 & 1 & 0 \end{array}], D_{2} = [\begin{array}{r} 0 & 0 & 0 & 0 \\ 0 & - 1 & 0 & 1 \end{array}], \\ D_{3} = [\begin{array}{r} 0 & 0 & - 1 & 1 \\ 0 & 0 & 0 & 0 \end{array}], and D_{4} = [\begin{array}{l} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}], \end{array}

so that we have D₁β = (∇β)₁₁, D₂β = (∇β)₁₂, D₃β = (∇β)₂₁, and D₄β = (∇β)₂₂.

Its corresponding augmented Lagrangian function is given by

L_{A} (w, β) = \sum_{l = 1}^{N^{2}} {{‖ w_{l} ‖}_{1} - v_{l}^{T} (D_{l} β - w_{l}) + \frac{α_{l}}{2} {‖ D_{l} β - w_{l} ‖}_{2}^{2} + \frac{γ}{2} {‖ A_{X} β - Y ‖}_{2}^{2}},

where v_l, α_l, and γ are tuning parameters. We may find the minimizer iteratively, and then the subproblem at each iteration of TVAL3 becomes min_{w_l,β} ℒ_A(w_l, β). In our algorithm, v_l is updated at each iteration. Moreover, α_l’s and γ as smoothing parameters can be selected by using either the C_p criterion or the K-fold cross-validation (CV). However, its computational time can be long even under current computing facilities. In our numerical examples, we pre-fix the tuning parameters by setting α_l = 2⁵ for l = 1, …, N² and γ = 2⁸. The simplest way to choose γ is to try different values from 2⁴ up to 2¹³ and compare the recovered images. The value of α_l is much less sensitive to the choice of γ. We leave tuning parameter optimization for our future research topic.

We describe the complete algorithm as follows.

Step 1. Initialize β⁽⁰⁾ and $v_{l}^{(0)}$ ;
Step 2. Given β = β⁽^k⁾ and $v_{l} = v_{l}^{(k)}$ , we solve for $ω_{l}^{(k + 1)}$ , l = 1, …, N², by minimizing
${‖ ω_{l} ‖}_{1} - v_{l}^{T} (D_{l} β - ω_{l}) + \frac{α_{l}}{2} {‖ D_{l} β - ω_{l} ‖}_{2}^{2} .$

The explicit solution (component-wise) is given by
$ω_{l} = {\begin{cases} D_{l} β - \frac{v_{l} + 1}{α_{l}}, & if D_{l} β > \frac{v_{l} + 1}{α_{l}}; \\ 0, & if \frac{v_{l} - 1}{α_{l}} \leq D_{l} β \leq \frac{v_{l} + 1}{α_{l}}; \\ D_{l} β - \frac{v_{l} - 1}{α_{l}}, & if D_{l} β < \frac{v_{l} + 1}{α_{l}} . \end{cases}$
Step 3. Given $ω_{l} = ω_{l}^{(k + 1)}$ and $v_{l} = v_{l}^{(k)}$ , l = 1, …, N², we solve for β⁽^k⁺¹⁾ by minimizing
$\sum_{l = 1}^{N^{2}} {- v_{l}^{T} D_{l} β + \frac{α_{l}}{2} {‖ D_{l} β - ω_{l} ‖}_{2}^{2} + \frac{γ}{2} {‖ A_{X} β - Y ‖}_{2}^{2}} .$

The explicit solution is given by
$β^{(k + 1)} = {\sum_{l = 1}^{N^{2}} (α_{l} D_{l}^{T} D_{l} + γ A_{X}^{T} A_{X})}^{- 1} {\sum_{l = 1}^{N^{2}} (v_{l}^{T} D_{l} + α_{l} D_{l}^{T} ω_{l} + γ A_{X}^{T} Y)} .$
Step 4. Given β = β⁽^k⁺¹⁾, $ω_{l} = ω_{l}^{(k + 1)}$ , update $v_{l}^{(k + 1)}$ by using
$v_{l}^{(k + 1)} = v_{l}^{(k)} = α_{l} (D_{l} β^{(k + 1)} - ω_{l}^{(k + 1)}) .$
Step 5. Iterate Steps 2–4 until convergence.

2.3 The error bound

In this subsection, we establish the nonasymptotic error bound for the TV estimator β̂ based on model (2). We consider two types of distances to measure the error. The first one is a weighted L₂ distance such that

{‖ \hat{β} - β_{0} ‖}_{X, 2} = {E^{*} ({〈 X_{n + 1}, \hat{β} - β_{0} 〉}^{2})}^{1 / 2},

where 𝔼^* represents taking expectation with respect to (Y_n₊₁, X_n₊₁) only. The second one is the TV distance between β̂ and β₀, ||β̂ − β₀||_TV.

We derive both error bounds by means of Haar wavelet basis. Various wavelet bases are commonly used to effectively represent images and the Haar wavelet is the simplest possible wavelet. The bivariate Haar wavelet basis for L₂(Ω) can be constructed as follows. Let ϕ⁰(t) = I_[0,1) be the indicator function, and the mother wavelet ϕ¹(t) = 1 for t ∈ [0, 1/2) and −1 for t ∈ [1/2, 1). Starting from the multivariate functions

ϕ^{d} (s, t) = ϕ^{d_{1}} (s) ϕ^{d_{2}} (t), d \in {(0, 1), (1, 0), (1, 1)},

the bivariate Haar basis functions include the indicator function I_[0,1)² and other functions

ϕ_{j, k}^{d} (u, v) = 2^{j} ϕ^{d} (2^{j} x - k), d \in {(0, 1), (1, 0), (1, 1)}, x = (u, v), j \geq 0, k \in ℤ^{2} \cap 2^{j} {[0, 1)}^{2} .

The bivariate Haar wavelet basis is an orthonormal basis for L₂[0, 1)². Note that discrete images are isometric to the space ℐ_N ⊂ L₂[0, 1)² of piecewise constant functions $I_{N} = {f \in L_{2} {[0, 1)}^{2} : f (s, t) = c_{j k}, \frac{j - 1}{N} \leq s < \frac{j}{N}, \frac{k - 1}{N} \leq t < \frac{k}{N}}$ via the identification c_jk = NX_jk. Letting N = 2^J, the bivariate Haar basis restricted to the N² basis functions {I_[0,1)², $ϕ_{j, k}^{d}$ , j ≤ J −1, d ∈ {(0, 1), (1, 0), (1, 1)}, k ∈ ℤ² ∩ 2^j[0, 1)²} forms an orthonormal basis for ℝ^N^×^N. Denote by Φ the discrete bivariate Haar transformation and {ϕ_l} the Haar basis, in which Φβ ∈ ℝ^N^×^N contains the bivariate Haar wavelet coefficients of β. Next, we review a theoretical result of Petrushev et al. (1999), who proved a deep and nontrivial result on BV(Ω). Specifically, it states that the Haar wavelet coefficients of β₀ ∈ BV(Ω) are in weak ℓ₁. That is, if the Haar coefficients are sorted decreasingly according to their absolute values, then the l–th rearranged coefficient is in absolute value less than c||β₀||_BV/l with c being an absolute constant.

Invoking Haar wavelets is only for theoretical investigation and we do not estimate the Haar coefficients directly. We now introduce the main assumptions of this paper:

A1
Assume that the coefficient image β₀ in the space of N × N blocks of pixel values with bounded variation. Assume that the error ε is sub-Gaussian.
A2
Assume that the discrete Haar representation of the image predictor X is $X = \sum_{l} ρ_{l}^{1 / 2} ξ_{l} ϕ_{l}$ , where ρ_l are positive constants and ξ_l are independently and identically distributed sub-Gaussian random variables with zero mean and unit variance.
A3
For any β ∈ BV(Ω), write β = Σ_lγ_lϕ_l, where the γ_l are the Haar basis coefficients of β. We arrange γ_l in a decreasing order according to their absolute values and denote the sorted coefficients as γ₍_l₎. Assume that the corresponding sorted ρ₍_l₎ associated with the same basis function satisfies c₁s⁻²^q ≤ ρ₍_s₎ ≤ c₂s⁻²^q with q > 1/2 for each s and two positive constants c₁, c₂.

Assumption A2 on the wavelet representation of X is reasonable because the discrete wavelet transformation approximately decorrelates or “whitens” data (Vidakovic 1999). Although we might use the Karhunen-Loève expansion of X, we do not adopt this approach in order to avoid additional complexity associated with the estimation of eigenfunctions. When we sort the Haar wavelet coefficients of both β and X, the corresponding basis functions may not follow the same order. Assumption A3 specifies the decay rate of the Haar wavelet coefficients of X. From A2, the predictor images X_i can be written as $X_{i} = \sum_{l} ρ_{l}^{1 / 2} ξ_{i l} ϕ_{l}$ . Let Ã be an n × N² matrix with the (i, l)-th element being $ξ_{i l} / \sqrt{n}$ . It is well-known that Ã satisfies the restricted isometry property (RIP) with a large probability (Candès et al. 2006a, 2006b). Specifically, if n ≥ C⁻²s log(N²/s), then with probability exceeding 1 − 2e⁻^Cn, we have

(1 - δ) {‖ u ‖}_{2}^{2} \leq {‖ \tilde{A} u ‖}_{2}^{2} \leq (1 + δ) {‖ u ‖}_{2}^{2}

(6)

for all s-sparse vectors u ∈ ℝ^N^² with a small RIP constant δ < C.

Let {γ̂_l} and {γ_l} be, respectively, the wavelet coefficients of β̂ and β₀. It turns out that ||β̂ − β₀||_X_,2 = {Σ_lρ_l(γ̂_l − γ_l)²}^1/2, which is the weighted L₂-norm of the wavelet coefficient difference. On the other hand, since ||ϕ_l||_TV ≤ 8 (Needell and Ward, 2013),

{‖ \hat{β} - β_{0} ‖}_{T V} \leq \sum_{l} ∣ {\hat{γ}}_{l} - γ_{l} ∣ {‖ ϕ_{l} ‖}_{T V} \leq 8 {‖ {\hat{γ}}_{l} - γ_{l} ‖}_{1},

which is bounded by the L₁-norm of the wavelet coefficient difference. We obtain the following theorem, whose detailed proof can be found in the Appendix.

Theorem 2.1

Assumptions A1-A3 hold. Let C be an absolute constant and λ = Cn^1/2. If n ≥ Cs²^q⁺¹ log(N²/s²^q⁺¹) and δ < 1/3 in (6), then with probability greater than 1 – 2 exp(−Cn), we have

{‖ \hat{β} - β_{0} ‖}_{X, 2} \leq C {σ + \frac{1}{{(s log N)}^{q + \frac{1}{2}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{s} ‖}_{1}},

(7)

and

{‖ \hat{β} - β_{0} ‖}_{T V} \leq C log (\frac{N^{2}}{s}) {{(s log N)}^{q + \frac{1}{2}} σ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}},

(8)

where (∇β₀)_s = arg min_u:s-sparse ||∇β₀ − u||₁ is the best s-sparse approximation to the discrete gradient ∇β₀.

Theorem 2.1 provides non-asymptotic error bounds for ||β̂ − β₀||_X_,2 and ||β̂ − β₀||_TV, which are specified explicitly in terms of sample size n and image size N × N, and the underlying smoothness of the true coefficient image based on the discrete gradient.

Remark 2.1

We call a prediction “stable” if ||β̂ − β₀||_X_,2 ≤ Cσ holds with a high probability. Assume that the coefficient image has the sparse discrete gradient, i.e., ∇β₀ is supported on S₀ with |S₀|₀ ≤ s. If λ = Cn^1/2, then Theorem 2.1 shows that ||β̂ − β₀||_X_,2 ≤ Cσ, which indicates that our prediction procedure is stable. Furthermore, for the extreme case with noiseless data, our prediction procedure is exact. The required sample size n is of order s²^q⁺¹ log(N²/s²^q⁺¹), which depends on the smoothness of the true coefficient image β₀, the relative smoothness between β₀ and X, and the image size N × N.

Remark 2.2

The parameter q characterizes the decay rate of the wavelet coefficients of X. The larger the q, the more the required sample size. Theorem 2.1 also shows that the larger q is, the smaller the prediction error is. When q = 0, this gives the special case discussed in Needell and Ward (2013).

3 Generalized scalar-on-image regression models

In this section, we extend all developments for model (2) to GSIRM-TV (1). Given X ∈ ℝ^N^×^N and Z ∈ R^p, the response variable Y is assumed to follow an exponential family distribution as

exp ({Y η (X, Z; θ_{0}, β_{0}) - b (η (X, Z; θ_{0}, β_{0}))} / a (ψ) + c (y, ψ)),

(9)

where a(·), b(·), and c(·) are known functions, and ψ is either known or considered as a nuisance parameter. Our GSIRM-TV also assumes β₀ ∈ BV(Ω). It can be shown (Nelder and Wedderburn, 1972) that

E (Y ∣ X) = μ (X, Z; θ_{0}, β_{0}) = \dot{b} (η (X, Z; θ_{0}, β_{0})) and Var (Y ∣ X) = a (ψ) \ddot{b} (η (X, Z; θ_{0}, β_{0})),

where ḃ(η) and b̈(η) are, respectively, the first and second derivatives of b(η) with respect to η. Moreover, $η (X, Z; θ_{0}, β_{0}) = {\dot{b}}^{- 1} (g^{- 1} (θ_{0}^{T} Z + 〈 X, β_{0} 〉)))$ is the canonical parameter of (9). A Gaussian distribution with variance σ² has a(ψ) = σ² and b(η) = η²/2, a Bernoulli distribution has a(ψ) = 1 and b(η) = log(1 + e^η), and a Poisson distribution has a(ψ) = 1 and b(η) = e^η.

3.1 Estimation

Let ξ = (θ, β) ∈ R^p×BV(Ω). Given the observed data, we propose to find estimates ξ̂ by minimizing a penalized likelihood function given by

n^{- 1} \sum_{i = 1}^{n} {Y_{i} η (X_{i}, Z_{i}; θ, β) - b (η (X_{i}, Z_{i}; θ, β))} + λ {‖ β ‖}_{T V} .

(10)

We use an algorithm, which is a standard iteratively reweighted least squares for GLMs, modified to add a TV penalty, to calculate ξ̂ = (θ̂, β̂). Given a trial estimate of ξ, denoted by ξ̂_I, we introduce the iterative weights and the working dependent variable as

{\hat{w}}_{i, I} = \ddot{b} ({\hat{η}}_{i, I}) and {\hat{Y}}_{i, I} = g ({\hat{μ}}_{i, I}) + (Y_{i} - {\hat{μ}}_{i, I}) \dot{g} ({\hat{μ}}_{i, I}),

(11)

where μ̂_i_,_I = μ(X_i, Z_i; ξ̂_I), ġ(μ) = dg(μ)/dμ, and η̂_i_,_I = η(X_i, Z_i; ξ̂_I). Then, we can calculate the next estimate of ξ, denoted by ξ̂_I₊₁, by minimizing

{\hat{ξ}}_{I + 1} = {argmin}_{ξ} {\sum_{i = 1}^{n} ω_{i, I} {[{\hat{Y}}_{i, I} - \partial_{ξ} μ_{i, I} ({\hat{ξ}}_{I}) ξ]}^{2} + λ {‖ β ‖}_{T V}},

(12)

where ∂_ξ = ∂/∂ξ. The optimization in (12) can be effectively solved by using TVAL3 algorithm discussed in Section 2. Finally, we can iteratively solve ξ̂_I until convergence.

We provide the complete algorithm as follows.

Step 1. Initialize ξ⁽⁰⁾ = (θ⁽⁰⁾, β⁽⁰⁾).
Step 2. For each k, define the weights and the working dependent variable in (11), and define the objective function in (12). Use TVAL3 algorithm to solve for ξ⁽^k⁺¹⁾ = (θ⁽^k⁺¹⁾, β⁽^k⁺¹⁾).
Step 3. Iterate Steps 2 and 3 until convergence.

We consider the logistic scalar-on-image regression model as an example. Specifically, Y_i given (X_i, Z_i) follows a Bernoulli distribution with the success probability p_i and $logit (p_{i}) = 〈 X_{i}, β_{0} 〉 + θ_{0}^{T} Z_{i}$ for i = 1, …, n. Given the current estimate ξ̂_I, it is easy to obtain the iterative weight and effective response variable, respectively, given by

{\hat{ω}}_{i, I} = \frac{e^{{\hat{η}}_{i, I}}}{{(1 + e^{{\hat{η}}_{i, I}})}^{2}} and {\hat{Y}}_{i, I} = {\hat{η}}_{i, I} + \frac{Y_{i} - {\hat{μ}}_{i, I}}{{\hat{μ}}_{i, I} (1 - {\hat{μ}}_{i, I})} .

Therefore, the estimate ξ̂_I₊₁ can be obtained by solving a weighted penalized least squares in (12).

3.2 The error bound

We establish an non-asymptotic prediction error bound for GSIRM-TV. We need some additional assumptions as follows.

B1
Assume η(X, Z; β₀, θ₀) is bounded almost surely. Given (X, Z), the response Y is sub-Gaussian, i.e., 𝔼{exp(t[Y − ḃ(η(X, Z; β₀, θ₀))])|X} ≤ exp(t²σ̃²/2) for some σ̃² > 0 and all t ∈ ℝ.
B2
The function ḃ(·) is monotonic with inf_t b̈(t) ≥ c₃ and sup_t b̈ (t) ≤ c₄ for two positive constants c₃ and c₄.

The sub-Gaussian assumption B1 holds for many well-known distributions, such as Gaussian. The assumption B2 requires that the second derivative of b(·) is bounded above and away from zero.

Theorem 3.1

Assumptions A1–A3 and B1–B2 hold. Let λ = Cn^−1/2, where C is a positive constant. If n ≥ Cs²^q⁺¹ log (N²/s²^q⁺¹) and δ < 1/3 in (6), with probability greater than 1 − 2 exp(−Cn), we have ||θ̂ − θ₀||₂ ≤ Cn^−1/2,

{‖ \hat{β} - β_{0} ‖}_{X, 2} \leq C {1 + \frac{1}{{(s log N)}^{q + \frac{1}{2}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{s} ‖}_{1}},

(13)

and

{‖ \hat{β} - β_{0} ‖}_{T V} \leq C log (\frac{N^{2}}{s}) {{(s log N)}^{q + \frac{1}{2}} σ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}} .

(14)

The conditional mean of Y_n₊₁ given X_n₊₁ is ḃ(η(X_n₊₁, β₀)). We may measure the accuracy of β̂ by 𝔼^*[ḃ(η(X_n₊₁, β̂)) − ḃ(η(X_n₊₁, β₀))]². Under B1, this risk is bounded by ${‖ \hat{β} - β_{0} ‖}_{X, 2}^{2}$ and thus, it is reasonable to study the non-asymptotic behavior of ||β̂ − β₀||_X_,2. Theorem 2.1 is a special case of Theorem 3.1 if it is assumed in Theorem 2.1 that responses follow a normal distribution. If assuming that the coefficient image has the sparse discrete gradient, Theorem 3.1 shows that ||β̂ − β₀||_X_,2 is bounded by a constant, which is proportional to σ under the assumption of Theorem 2.1. This shows that our prediction procedure is stable for GSIRM-TV.

4 Simulation Studies

In this section, we conducted a set of Monte Carlo simulations to examine the finite sample performance of the TV estimate β̂ and compare it with five competing methods. The first approach (Lasso) is to calculate the Lasso estimates of β₀. The second one (Lasso-Haar) is to calculate the Lasso estimates of the Haar coefficients of β₀ and use the inverse discrete wavelet transform to calculate the estimates of β₀. The third one (Matrix-Reg) is to estimate β₀ by using a recent development called regularized matrix regression (Zhou and Li 2014), which treats the coefficient image as a matrix and penalizes the nuclear norm of this matrix. The fourth one (FPCR) is the functional principal component regression approach (Reiss and Ogden 2007, 2010) by using tensor product cubic B-splines to approximate the coefficient function. The fifth one (WNET) is to perform scalar-on-image regression in the wavelet domain by naive elastic net (Zhao et al. 2014). Among these six approaches, the TV, Lasso, Lasso-Haar, and Matrix-Reg methods have been implemented by Matlab and the FPCR and WNET methods have been implemented in the R packages ‘refund’ and ‘refund.wave’ (see Reiss et al. 2015), respectively. For the FPCR and WNET methods, we have used the default settings of both packages. The choice of wavelet basis in WNET is the Daubechies basis.

We present some results based on linear scalar-on-image regression model (2). Specifically, X_i were simulated from a 64×64 phantom map with N = 64 and 4, 096 pixels according to a spatially correlated random process X_i = Σ_l l⁻^q^/2ξ_ilϕ_l with q = 0, 0.5, and 1, where the ξ_l are standard normal random variables and the ϕ_l are bivariate Haar wavelet basis functions. We consider four different β₀ images including triangle, oval, T-shape, and checkerboard shapes (Figure 3). Among them, the triangle and oval images are convex, while the other two are not. Errors ε_i were independently generated from N(0, 1). We set n₁ = 300 for the training set and n₂ = 100 for the test set. We repeated each setting 100 times. We calculated the root mean squared prediction error (RMSPE) to compare the finite sample performance of the six different estimation methods. Let β̂ be the estimated coefficient image from the training set and Ŷ_i = 〈β̂, X_i〉 be the predicted responses for the test set. For each test set, RMSPE is defined by

The true coefficient images used for the simulation study.

RMSPE = \sqrt{n_{2}^{- 1} \sum_{i = 1}^{n_{2}} {({\hat{Y}}_{i} - Y_{i})}^{2}} .

We also calculated the means and standard errors of RMSPEs for the 100 testing datasets.

Figures 4 and 5 present the estimated β₀ from a randomly selected training dataset with q = 0 and q = 0.5, respectively, for the sample size n = 300. For all four different shapes, our TV estimates can capture the sharp boundaries of the underlying shapes. In contrast, the Lasso method fails for all shapes, since the predictor images X_i are highly correlated. The Lasso estimates of the Haar coefficients can roughly capture the true shapes. However, this method cannot faithfully recover the sharp boundaries of the triangle, oval, and T shapes, whereas it does work very well for the checkerboard shape, since this checkerboard shape is exactly one of the bivariate Haar wavelet basis functions. The matrix regression approach can roughly capture the true shapes when q = 0, and unfortunately this method fails for the case when q = 0.5, for which the entries of X are spatially correlated. The PCR approach uses splines to approximate the predictor images, and it cannot preserve the sharp edges of coefficient estimator for our examples. The WNET method fails for the case when q = 0 but it can capture the shapes of the true coefficient image when the predictors are more spatially correlated.

The estimated coefficient images from six estimation methods when q = 0 and n = 300: TV (Top row); Lasso (Second row); Lasso-Haar (Third row); Matrix regression (fourth row); FPCR (Fifth row); and WNET(Sixth row).

The estimated coefficient images from six methods when q = 0.5 and n = 300: TV (Top row); Lasso (Second row); Lasso-Haar (Third row); Matrix regression (fourth row); FPCR (Fifth row); and WNET(Sixth row).

Table 1 presents the RMSPEs of all six methods across all shapes. Overall, our TV method has significantly smaller prediction errors, in particular for q = 0. It is expected that the Lasso method leads to the largest prediction error. For all these methods, the larger q is, the smaller are their RMSPEs. For a larger q which means the predictor images are more spatially correlated, the performances of the TV, Lasso-Haar, FPCR, and WNET are similar to each other.

Table 1.

The RMSPEs of six methods including TV, Lasso, Lasso-Haar, Matrix-Reg, FPCR, and WNET for four different shapes: the numbers in brackets are the corresponding standard errors of those RMSPEs.

	TV			Lasso			Lasso-Haar
q	0	0.5	1	0	0.5	1	0	0.5	1
Triangle	3.20 (0.70)	1.53 (0.12)	1.18 (0.08)	34.02 (2.25)	13.12 (0.95)	3.96 (0.32)	18.76 (1.43)	5.08 (0.52)	1.98 (0.17)
Oval	1.69 (0.22)	1.49 (0.13)	1.22 (0.09)	31.16 (2.00)	11.83 (0.95)	3.57 (0.25)	14.99 (1.24)	3.82 (0.28)	1.67 (0.15)
T-shape	2.04 (0.51)	1.47 (0.09)	1.23 (0.10)	30.10 (2.41)	18.81 (0.81)	3.34 (0.28)	19.65 (1.84)	4.62 (0.46)	1.69 (0.16)
Checkerboard	6.27 (1.09)	1.45 (0.11)	1.10 (0.08)	49.00 (3.54)	25.46 (1.65)	9.08 (0.83)	1.43 (0.14)	1.06 (0.08)	1.05 (0.07)

	Matrix-Reg			FPCR			WNET
q	0	0.5	1	0	0.5	1	0	0.5	1

Triangle	23.13 (1.78)	6.46 (0.55)	4.00 (0.34)	10.89 (0.76)	2.27 (0.18)	1.19 (0.09)	28.42 (2.45)	2.10 (0.17)	1.28 (0.10)
Oval	18.81 (1.58)	5.58 (0.37)	3.41 (0.30)	9.28 (0.68)	2.12 (0.15)	1.19 (0.09)	24.74 (2.40)	2.03 (0.16)	1.26 (0.10)
T-shape	20.97 (1.45)	5.49 (0.38)	3.57 (0.33)	11.69 (0.82)	3.42 (0.26)	1.64 (0.12)	24.41 (2.18)	2.52 (0.21)	1.46 (0.13)
Checkerboard	35.33 (2.89)	12.80 (0.95)	6.35 (0.38)	10.70 (0.83)	3.05 (0.22)	1.14 (0.10)	44.24 (3.23)	5.41 (0.55)	2.03 (0.15)

Open in a new tab

5 Real data analysis

To illustrate the usefulness of our proposed model, we consider anatomical MRI data collected at the baseline by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, which is a large scale multi-site study collecting clinical, imaging, and laboratory data at multiple time points from healthy controls, individuals with amnestic mild cognitive impairment, and subjects with Alzheimer’s disease (AD). “Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year publicprivate partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California, San Francisco. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow up duration of each group is specified in the protocols for ADNI-1, ADNI-2 and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see www.adni-info.org.”

Alzheimer’s disease as an age-related neurodegenerative brain disorder is often characterized by progressive loss in memory and deterioration of cognitive functions (De La Torre 2010; Weiner et al. 2012). Important neuropathological hallmarks of AD are the gradual intraneuronal accumulation of neurofibrillary tangles formed as a result of abnormal hyperphosphorylation of cytoskeletal tau protein, extracellular deposition of amyloid-β (Aβ) protein as senile plaques, and massive neuronal death. These pathologies are evident in the hippocampus, which is located in the medial temporal lobe underneath the cortical surface, and other vulnerable brain areas. The hippocampus belongs to the limbic system and plays important roles in the consolidation of information from short-term memory to long-term memory and spatial navigation (Colom et al. 2013; Fennema-Notestine et al. 2009; Luders et al. 2013).

Given the MRI scans, hippocampal substructures were segmented with FSL FIRST (Patenaude et al. 2011) and hippocampal surfaces were automatically reconstructed with the marching cube method (Lorensen and Cline 1987). We adopted a surface fluid registration based hippocampal subregional analysis package (Shi et al. 2013), which uses isothermal coodinates and fluid registration to generate one-to-one hippocampal surface registration for surface statistics computation. It introduced two cuts on a hippocampal surface to convert it into a genus zero surface with two open boundaries. The locations of the two cuts were at the front and back of the hippocampal surface. By using conformal parameterization, it essentially converts a 3D surface registration problem into a two-dimensional (2D) image registration problem. The flow induced in the parameter domain establishes high-order correspondences between 3D surfaces. Finally, various surface statistics were computed on the registered surface, such as multivariate tensor-based morphometry (mTBM) statistics (Wang et al. 2010), which retain the full tensor information of the deformation Jacobian matrix, together with the radial distance (Pizer et al. 1999). This software package and associated image processing methods have been adopted and described by various studies (Shi et al. 2014).

We applied GSIRM-TV to the hippocampus data set calculated from ADNI. The sample in our investigation includes n = 403 subjects: 223 healthy controls (HC) (107 females and 116 males) and 180 individuals with AD (87 females and 93 males). We consider binary disease status with 0 being HC and 1 being AD as responses. The image predictor X_i is the 2D representation of left hippocampus. The covariate vector Z_i includes constant(=1), gender (Female=0 and Male = 1), age (55–92), and behavior score (1–36). Given (X_i, Z_i), Y_i is assumed to follow a Bernoulli distribution with the success probability p_i satisfying

logit (p_{i}) = 〈 X_{i}, β_{0} 〉 + θ_{0}^{T} Z_{i} for i = 1, \dots, n .

We used the iterative reweighted algorithm described above to estimate the unknown parameters.

Table 2 presents the estimates of θ₀ and their corresponding standard deviations, which were calculated by using the bootstrap method. Figure 7 shows the estimated coefficient images by using the five estimation methods. The effects around pixels (5, 40), (40, 40), (95, 40) seem to be captured well by our TV estimate. The confidence band for the coefficient image can also be obtained by using the bootstrap method. We randomly partitioned the hippocampus data set into a training set with n₁ = 203 and a test set with n₂ = 200. We repeated this random partition for 100 times and computes 100 classification errors. The average classification error of TV is 8.13% with a standard error 1.56%. We also obtain the average classification errors for other five methods. The average classification errors are 12.23%(7.36%), 21.65%(15.56%), 12.03%(11.55%), 17.13%(3.27%), 16.45%(15.57%), respectively, for Lasso, Lasso-Haar, matrix regression, FPCR, and WNET. For the WNET method, since the R code requires the image size to be a power of 2, we have added zeros to make the image size of 256 × 256 as suggested by one of the referees. Inspecting Table 2 reveals that sex and age are not significant in GSIRM-TV. We run the same procedure without sex and age and obtained a similar classification result as the full model, which is omitted from the paper.

Table 2.

ADNI hippocampus data set: the estimated coefficients of the four scalar covariates and their standard deviations in parentheses.

	intercept	sex	age	behavior score
θ̂	−1.807 (3.186)	−0.533 (0.590)	−0.093 (0.043)	0.869 (0.111)

Open in a new tab

Estimated coefficient images for hippocampus data based four methods: the 2d-representation of TV estimator (a) and the surface representation of TV estimator (b), Lasso estimator (c), Lasso-wavelet estimator (d), matrix regression estimator (e), FPCR estimator (f), and WNET estimator (g).

6 Conclusion

We have developed a class of GSIRM-TVs for scalar response and imaging and/or scalar predictors, while explicitly assuming that its slope function belongs to BV(Ω). We have developed an efficient penalized total variation minimization to estimate the coefficient image. We have used simulations and real data analysis to show that GSIRM-TV is quite efficient for estimating the slope function, while preserving its edges and jumps. We have established the nonasymptotic error bound of the TV estimate for the excess risk.

It is known that many image data have small total variation and are compressible with respect to wavelet transform. Therefore, we may generalize our approach to include both total variation penalty and Lasso penalty on the wavelet coefficients. Specifically, let Φ be the wavelet transformation operator and γ be the wavelet coefficients of the coefficient image β₀. We may calculate γ by minimizing

\sum_{i = 1}^{n} {(Y_{i} - 〈 X_{i}, Φ^{- 1} γ 〉)}^{2} + λ_{1} {‖ Φ^{- 1} γ ‖}_{T V} + λ_{2} {‖ γ ‖}_{1},

(15)

where Φ⁻¹ is the inverse discrete wavelet transform, and β = Φ⁻¹γ. In (15), there are two smoothing parameters λ₁ and λ₂ which need to be selected. Efficient algorithm is also needed to be developed to solve (15). We leave this as further research work.

We have so far focused on two-dimensional (2-D) images. It would be interesting to extend our method to analyze k–dimensional (k-D) images for k ≥ 2 (Zhou et al., 2013; Zhu et al., 2014). For example, consider a 3-D image f ∈ ℝ^N^³, where f = (f_e), in which e = (e₁, e₂, e₃) ∈ {1, 2, 3}³. The inner product can be defined as

〈 f, g 〉 = \sum_{e \in {1, 2, 3}^{3}} f_{e} \cdot g_{e} .

For ℓ = 1, 2, and 3, the discrete derivative of f in the direction of r_ℓ is f_{r_ℓ} ∈ ℝ^N^{^ℓ−1×}^N^×^N^{^3−ℓ},

{(f_{1})}_{e} = f_{(e_{1} + 1, e_{2}, e_{3})} - f_{(e_{1}, e_{2}, e_{3})}, {(f_{2})}_{e} = f_{(e_{1}, e_{2} + 1, e_{3})} - f_{(e_{1}, e_{2}, e_{3})}, {(f_{3})}_{e} = f_{(e_{1}, e_{2}, e_{3} + 1)} - f_{(e_{1}, e_{2}, e_{3})},

and the 3-D discrete gradient is (∇f)_e = (f_r_{_ℓ})_e for e_ℓ ≤ N − 1 and zero elsewhere. Hence the 3-D anisotropic and isotropic total variation seminorm can be defined similarly. We may consider a similar total variation optimization (4) to estimate the 3-D coefficient image. This research is currently under investigation and will be presented in another report.

Acknowledgments

The authors would like to thank Dr. YalinWang for sharing the processed data. We would also like to thank the Editor, Associate Editor, and referees for constructive comments. The research of Xiao Wang is supported by NSF grants DMS1042967 and CMMI1030246. The research of Hongtu Zhu is partially supported by NIH grants MH086633 and 1UL1TR001111, NSF grants SES-1357666 and DMS-1407655, and a grant from Cancer Prevention Research Institute of Texas. This material was based upon work partially supported by the NSF grant DMS-1127914 to the Statistical and Applied Mathematical Science Institute.

7 Appendix

In this appendix, we provide proofs of Theorems 2.1 and 3.1. The constant C represents a universal constant independent of everything else, but it may be different from different lines.

7.1 Proof of Theorem 2.1

We prove the theorem by extending the arguments from Candès, Romberg, and Tao (2006a, 2006b) and Needell and Ward (2013). In the following, a ≲b means that there exists a constant C such that a ≤ Cb.

Recall that {ϕ_l} is a set of discrete bivariate Haar wavelet basis functions. Write $X_{i} = \sum_{l} ξ_{i l} ρ_{l}^{1 / 2} ϕ_{l}$ , β₀ = Σ_l γ_lϕ_l, and β̂ = Σ_l γ̂_lϕ_l. We aim to derive the error bounds of Σ_l ρ_l(γ_l − γ̂_l)² and Σ_l |γ_l − γ̂_l|. Denote by α = β₀ − β̂ and write α = Σ_l h_lϕ_l, where the h_l = γ_l − γ̂_l are the wavelet coefficients of the difference between the true coefficient image and the estimated coefficient image. We may sort h_l in descending order according to their absolute values. Denote the sorted coefficients by h₍_l₎. The corresponding ρ_l with the same basis function with h_l is denoted by ρ₍_l₎. Note that the ρ₍_l₎ are not necessarily sorted, but it is assumed to satisfy Condition A3.

Let S denote the support of s largest entries in the absolute values of α. As shown in Lemma 9 of Needell and Ward (2013), the set K of wavelets which are non-constant over S has cardinality at most 8s log N. With an abuse of notation, let K = 8s log N. Lemma 7.1 derives cone constraints for the wavelet coefficients h₍_j₎ and the weighted wavelet coefficients $ρ_{(j)}^{1 / 2} h_{(j)}$ .

In the following, we focus on $ρ_{(l)}^{1 / 2} h_{(l)}$ for l = K + 1, …, N². Let

\tilde{s} = c s^{2 q + 1} {(log N)}^{2 q + 1}, d = ⌊ N^{2} / (4 \tilde{s}) ⌋ .

We may write K^c = K₁ ∪ K₂ ∪ ⋯ ∪ K_d, where K₁ consists of 4s̃ largest |h₍_l₎| within K^c, K₂ consists of next 4s̃ largest-magnitude of |h₍_l₎|, and so on. Since ρ₍_l₎ is of order l⁻²^q and the magnitude of each $ρ_{l}^{1 / 2} ∣ h_{l} ∣$ in K_j₋₁ is larger than that in K_j up to a constant, we have

{(\sum_{l \in K_{j}} ρ_{l} {∣ h_{l} ∣}^{2})}^{1 / 2} ≲ \frac{1}{2 \sqrt{\tilde{s}}} \sum_{l \in K_{j - 1}} ρ_{l}^{1 / 2} ∣ h_{l} ∣ for j = 2, 3, ....

Combining this result with Lemma 7.1 yields

\begin{array}{l} \sum_{j = 2}^{d} {(\sum_{l \in K_{j}} ρ_{l} {∣ h_{l} ∣}^{2})}^{1 / 2} ≲ \frac{1}{2 \sqrt{\tilde{s}}} \sum_{l = K + 1}^{N^{2}} ρ_{(l)}^{1 / 2} ∣ h_{(l)} ∣ \\ ≲ \frac{1}{2 \sqrt{\tilde{s}}} {s^{q} {(log N)}^{q} \sum_{l = 1}^{K} ρ_{(l)}^{1 / 2} ∣ h_{(l)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}} \\ ≲ \frac{1}{2 \sqrt{K}} \sum_{l = 1}^{K} ρ_{(l)}^{1 / 2} ∣ h_{(l)} ∣ + \frac{1}{2 \sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ \leq \frac{1}{2} {(\sum_{j = 1}^{K} ρ_{(l)} {∣ h_{(l)} ∣}^{2})}^{1 / 2} + \frac{1}{2 \sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} . \end{array}

(16)

Recall that β̂ is calculated by solving (5). Let Ã be an n × N² matrix with the (i, l)th element being n^−1/2ξ_il, ρ be a diagonal matrix with the l-th diagonal element ρ_l, and γ and h be the wavelet coefficients of β₀ and α, respectively. Therefore, $A_{X} β_{0} = \sqrt{n} \tilde{A} ρ^{1 / 2} γ$ and $A_{X} δ β = \sqrt{n} \tilde{A} ρ^{1 / 2} h$ . Let $λ = C \sqrt{n} σ$ . With probability more than 1 − e⁻^Cn, ${‖ Y - A_{X} β_{0} ‖}_{2} \leq C \sqrt{n} σ$ . This gives

\sqrt{n} {‖ \tilde{A} ρ^{1 / 2} h ‖}_{2} = {‖ A_{X} α ‖}_{2} = {‖ A_{X} β_{0} - A_{X} \hat{β} ‖}_{2} \leq {‖ Y - A_{X} β_{0} ‖}_{2} + {‖ Y - A_{X} \hat{β} ‖}_{2} ≲ \sqrt{n} σ .

(17)

Following the argument in Candès et al. (2006a, 2006b), if n ≥ C⁻²s log(N²/s), then Ã satisfies the restricted isometry property (RIP) with a large probability: with probability exceeding 1 − 2e⁻^Cδ^²ⁿ,

(1 - δ) {‖ u ‖}_{2}^{2} \leq {‖ \tilde{A} u ‖}_{2}^{2} \leq (1 + δ) {‖ u ‖}_{2}^{2},

(18)

for all s-sparse vector u ∈ ℝ^N^². Therefore,

\begin{array}{l} \sqrt{n} σ ⊳ \sqrt{n} {‖ \tilde{A} ρ^{1 / 2} h ‖}_{2} \geq \sqrt{n} {‖ \tilde{A} {(ρ^{1 / 2} h)}_{K} + \tilde{A} {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} - \sqrt{n} \sum_{j = 2}^{d} {‖ \tilde{A} {(ρ^{1 / 2} h)}_{K_{j}} ‖}_{2} \\ \geq \sqrt{n (1 - δ)} {‖ {(ρ^{1 / 2} h)}_{K} + {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} - \sqrt{n (1 + δ)} \sum_{j = 2}^{d} {‖ {(ρ^{1 / 2} h)}_{K_{j}} ‖}_{2} \\ \geq \sqrt{n (1 - δ)} {‖ {(ρ^{1 / 2} h)}_{K} + {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} - \sqrt{n (1 + δ)} (\frac{1}{2} {‖ {(ρ^{1 / 2} h)}_{K} ‖}_{2} + \frac{1}{2 \sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}) \\ \geq (\sqrt{1 - δ} - \frac{\sqrt{1 + δ}}{2}) \sqrt{n} {‖ {(ρ^{1 / 2} h)}_{K} + {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} - \frac{\sqrt{n (1 + δ)}}{2 \sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} . \end{array}

Since δ < 1/3, we have

\sqrt{n} {‖ {(ρ^{1 / 2} h)}_{K} + {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} ≲ 5 \sqrt{n} σ + \frac{3 \sqrt{n}}{\sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} .

(19)

Further, it follows from (16) and (19) that

{‖ \sum_{j = 2}^{d} {(ρ^{1 / 2} h)}_{K_{j}} ‖}_{2} \leq \sum_{j = 2}^{d} {‖ {(ρ^{1 / 2} h)}_{K_{j}} ‖}_{2} \leq \frac{1}{2} {‖ {(ρ^{1 / 2} h)}_{K} + {(ρ^{1 / 2} h)}_{K_{1}} ‖}_{2} + \frac{\sqrt{n}}{2 \sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} .

We arrive at

\sqrt{n} {‖ ρ^{1 / 2} h ‖}_{2} ≲ 8 \sqrt{n} σ + \frac{5 \sqrt{n}}{\sqrt{\tilde{s}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} ≲ \sqrt{n} σ + \frac{\sqrt{n}}{{(s log N)}^{q + \frac{1}{2}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} .

This gives

{‖ ρ^{1 / 2} h ‖}_{2} \leq C {σ + \frac{1}{{(s log N)}^{q + \frac{1}{2}}} {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}},

which provides the weighted L₂ error bound.

Finally, because

\begin{array}{l} \sqrt{n} σ ⊳ \sqrt{n} {‖ \tilde{A} ρ^{1 / 2} h ‖}_{2} \geq \sqrt{n (1 - δ)} {‖ (ρ^{1 / 2} h) ‖}_{2} \\ \geq \sqrt{n (1 - δ)} {‖ {(ρ^{1 / 2} h)}_{K} ‖}_{2} \geq \sqrt{n (1 - δ)} K^{- q} {(\sum_{j = 1}^{K} {∣ h_{(j)} ∣}^{2})}^{1 / 2}, \end{array}

we have

{(\sum_{j = 1}^{K} {∣ h_{(j)} ∣}^{2})}^{1 / 2} ≲ {(s log N)}^{q} σ .

Combining this with (20) leads to the L₁ error bound since

\begin{array}{l} \sum_{j = 1}^{N^{2}} ∣ h_{(j)} ∣ \leq (1 + log (\frac{N^{2}}{s})) \sum_{j = 1}^{K} ∣ h_{(j)} ∣ + log (\frac{N^{2}}{s}) {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ \leq (1 + log (\frac{N^{2}}{s})) K^{1 / 2} {(\sum_{j = 1}^{K} {∣ h_{(j)} ∣}^{2})}^{1 / 2} + log (\frac{N^{2}}{s}) {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ ≲ (1 + log (\frac{N^{2}}{s})) K^{q + \frac{1}{2}} σ + log (\frac{N^{2}}{s}) {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ ≲ log (\frac{N^{2}}{s}) {{(s log N)}^{q + \frac{1}{2}} σ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}} . \end{array}

This completes the proof of the theorem.

Lemma 7.1

Let K = 8s log N. Then

\sum_{j = K + 1}^{N^{2}} ∣ h_{(j)} ∣ ≲ log (\frac{N^{2}}{s}) (\sum_{j = 1}^{K} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}),

(20)

\sum_{j = K + 1}^{N^{2}} ρ_{(j)}^{1 / 2} ∣ h_{(j)} ∣ ≲ s^{q} {(log N)}^{q} \sum_{j = 1}^{K} ρ_{(j)}^{1 / 2} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} .

(21)

Proof

Let α = β₀−β̂. We first derive a cone constraint for α. Let S̃ be the support of the largest s elements of ∇β₀. Observe that

{‖ \nabla \hat{β} ‖}_{1} \leq {‖ \nabla β_{0} ‖}_{1} = {‖ {(\nabla β_{0})}_{\tilde{S}} ‖}_{1} + {‖ {(\nabla β_{0})}_{{\tilde{S}}^{c}} ‖}_{1},

and on the other hand,

\begin{array}{l} {‖ \nabla \hat{β} ‖}_{1} = {‖ {(\nabla β_{0})}_{\tilde{S}} - {(\nabla α)}_{\tilde{S}} ‖}_{1} + {‖ {(\nabla β_{0})}_{{\tilde{S}}^{c}} - {(\nabla α)}_{{\tilde{S}}^{c}} ‖}_{1} \\ \geq {‖ {(\nabla β_{0})}_{\tilde{S}} ‖}_{1} - {‖ {(\nabla α)}_{\tilde{S}} ‖}_{1} - {‖ {(\nabla β_{0})}_{{\tilde{S}}^{c}} ‖}_{1} + {‖ {(\nabla α)}_{{\tilde{S}}^{c}} ‖}_{1} . \end{array}

Combining these two inequalities yields

{‖ {(\nabla α)}_{{\tilde{S}}^{c}} ‖}_{1} \leq {‖ {(\nabla α)}_{\tilde{S}} ‖}_{1} + 2 {‖ \nabla β_{0} - {(\nabla β_{0})}_{\tilde{S}} ‖}_{1} .

(22)

The cone constraint on the discrete gradient can be transferred to a cone constraint on the wavelet coefficients. Write

α = \sum_{j \in S} h_{j} ϕ_{j} + \sum_{j \in S^{c}} h_{j} ϕ_{j},

where the wavelet coefficients are nonconstant over S which has cardinality at most K = 8s log N. Recall that |h_j| ≤ Cj⁻¹||∇α||₁. From (22) we have

\begin{array}{l} \sum_{j = K + 1}^{N^{2}} ∣ h_{(j)} ∣ \leq \sum_{j = s + 1}^{N^{2}} ∣ h_{(j)} ∣ ≲ log (\frac{N^{2}}{s}) {‖ \nabla α ‖}_{1} \\ = log (\frac{N^{2}}{s}) ({‖ {(\nabla α)}_{\tilde{S}} ‖}_{1} + {‖ {(\nabla α)}_{{\tilde{S}}^{c}} ‖}_{1}) \\ ≲ log (\frac{N^{2}}{s}) (2 {‖ {(\nabla α)}_{\tilde{S}} ‖}_{1} + 2 {‖ \nabla β_{0} - {(\nabla β_{0})}_{\tilde{S}} ‖}_{1}) \\ ≲ log (\frac{N^{2}}{s}) (\sum_{j = 1}^{K} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}), \end{array}

where the last inequality holds because ||∇ϕ_j||₁ ≤ 8 (Needell and Ward, 2013), and

{‖ {(\nabla α)}_{\tilde{S}} ‖}_{1} = {‖ \nabla (\sum_{j \in \tilde{S}} h_{(j)} ϕ_{(j)}) ‖}_{1} \leq \sum_{j \in \tilde{S}} ∣ h_{(j)} ∣ {‖ \nabla ϕ_{(j)} ‖}_{1} \leq 8 \sum_{j \in \tilde{S}} ∣ h_{(j)} ∣ \leq 8 \sum_{j = 1}^{K} ∣ h_{(j)} ∣ .

Furthermore, since ρ₍_j₎ is of order j⁻²^q for q > 0 from A3, we have

\begin{array}{l} \sum_{j = K + 1}^{N^{2}} ρ_{(j)}^{1 / 2} ∣ h_{(j)} ∣ ≲ \sum_{j = K + 1}^{N^{2}} j^{- q} j^{- 1} {‖ \nabla α ‖}_{1} \leq {‖ \nabla α ‖}_{1} \\ ≲ \sum_{j = 1}^{K} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ ≲ K^{q} \sum_{j = 1}^{K} ρ_{(j)}^{1 / 2} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1} \\ = s^{q} {(log N)}^{q} \sum_{j = 1}^{K} ρ_{(j)}^{1 / 2} ∣ h_{(j)} ∣ + {‖ \nabla β_{0} - {(\nabla β_{0})}_{S} ‖}_{1}, \end{array}

(23)

which completes the proof of the lemma.

7.2 Proof of Theorem 3.1

Write ξ = (θ, β), W = (X, Z), and W_i = (X_i, Z_i), i = 1, …, n. Denote η_i(ξ) = η(W_i; ξ), η_W(ξ) = η(W; ξ), and

M_{n} (ξ) = - \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} η_{i} (ξ) - b (η_{i} (ξ))}, {\bar{M}}_{n} (ξ) = - \frac{1}{n} \sum_{i = 1}^{n} {\dot{b} (η_{i} (ξ)) η_{i} (ξ) - b (η_{i} (ξ))} .

Recall that ξ̂ minimizes −M_n(ξ) + λ||β||_TV. We have

0 \leq M_{n} (\hat{ξ}) - M_{n} (ξ_{0}) - λ {‖ \hat{β} ‖}_{T V} + λ {‖ β_{0} ‖}_{T V} .

Direct calculation yields

\begin{array}{l} M_{n} (ξ) - M_{n} (ξ_{0}) = - \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - \dot{b} (η_{i} (ξ_{0}))) η_{i} (ξ - ξ_{0}) + \dot{b} (η_{i} (ξ_{0})) η_{i} (ξ - ξ_{0}) - (b (η_{i} (ξ)) - b (η_{i} (ξ_{0}))) \\ = H_{n} (\hat{ξ}) - \frac{1}{2 n} \sum_{i = 1}^{n} \ddot{b} (η_{i}^{*}) η_{i}^{2} (ξ - ξ_{0}) . \end{array}

(24)

where η_i(ξ − ξ₀) = (θ − θ₀)^TZ + 〈X,β − β₀〉, and

H_{n} (ξ) = - \frac{1}{n} \sum_{i = 1}^{n} (Y_{i} - \dot{b} (η_{i} (ξ_{0}))) η_{i} (ξ - ξ_{0}),

(25)

and $η_{i}^{*}$ is a number between η_i(ξ) and η_i(ξ₀). Therefore,

\begin{array}{l} \frac{1}{2 n} \sum_{i = 1}^{n} \ddot{b} (η_{i}^{*}) η_{i}^{2} (ξ - ξ_{0}) \leq H_{n} (\hat{ξ}) - λ {‖ \hat{β} ‖}_{T V} + λ {‖ β_{0} ‖}_{T V} \\ \leq sup_{ξ} ∣ H_{n} (ξ) ∣ - λ {‖ \hat{β} ‖}_{T V} + λ {‖ β_{0} ‖}_{T V} \\ ≲ n^{- 1 / 2} + λ, \end{array}

where the last inequality is because of Lemma 7.2.

Let $g^{*} = {(g_{1}^{*} (X_{1}, \dots, X_{n}), \dots, g_{n}^{*} (X_{1}, \dots, X_{n}))}^{T}$ be the least favorable direction such that, for any g = (g₁(X₁,…, X_n), …, g_n(X₁, …, X_n))^T, we have

\frac{1}{n} \sum_{i = 1}^{n} {(Z_{i} - g_{i}^{*} (X_{1}, \dots, X_{n}))}^{T} g_{i} (X_{1}, \dots, X_{n}) = 0.

Note that

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} η_{i}^{2} (\hat{ξ} - ξ_{0}) = \frac{1}{n} \sum_{i = 1}^{n} {({(\hat{θ} - θ_{0})}^{T} Z_{i} + 〈 X_{i}, \hat{β} - β_{0} 〉)}^{2} \\ = \frac{1}{n} \sum_{i = 1}^{n} {({(\hat{θ} - θ_{0})}^{T} (Z_{i} - g_{i}^{*}) + {(\hat{θ} - θ_{0})}^{T} g_{i}^{*} + 〈 X_{i}, \hat{β} - β_{0} 〉)}^{2} \\ = {(\hat{θ} - θ_{0})}^{T} (\frac{1}{n} \sum_{i = 1}^{n} (Z_{i} - g_{i}^{*}) {(Z_{i} - g_{i}^{*})}^{T}) (\hat{θ} - θ_{0}) + \frac{1}{n} \sum_{i = 1}^{n} {({(\hat{θ} - θ_{0})}^{T} g_{i}^{*} + 〈 X_{i}, \hat{β} - β_{0} 〉)}^{2} . \end{array}

Assuming that $n^{- 1} \sum_{i = 1}^{n} (Z_{i} - g_{i}^{*}) {(Z_{i} - g_{i}^{*})}^{T}$ is non-singular, we conclude that ||θ̂ − θ₀||₂ ≲ n^−1/2 by choosing λ = Cn^−1/2, and ${‖ A_{X} \hat{β} - A_{X} β_{0} ‖}_{2} ≲ \sqrt{n}$ , which gives an equation similar to (17). The following proofs will go through with the same arguments as for the proof of Theorem 2.1. This completes the proof of Theorem 3.1.

Lemma 7.2

Assume B1–B2 hold. Let r = q − 1/2 > 0. Let B_δ = {ξ: δ/2 ≤ d_W(ξ, ξ₀) ≤ δ}, where

d_{W}^{2} (ξ, ξ_{0}) = {‖ θ - θ_{0} ‖}_{2}^{2} + E_{X} ({〈 X, β - β_{0} 〉}^{2})

Then,

sup_{ξ \in B_{δ}} | H_{n} (ξ) | = O_{p} (n^{- \frac{1}{2}} δ^{\frac{2 r + 1}{2 (r + 1)}}) .

(26)

Proof

Recall that H_n(ξ) = − (P_n − ℙ) f_ξ(W, Y), where

f_{ξ} (W, Y) = (Y - \dot{b} (η_{W} (ξ_{0}))) η_{W} (ξ - ξ_{0}) .

Consider ℳ_δ = {f_ξ(W, Y) : ξ ∈ B_δ}, with L₂(P) norm, i.e., for any f ∈ ℳ_δ, ${‖ f ‖}_{P, 2}^{2} = E_{W, Y} f^{2} (W, Y)$ .

Let $G_{n} = \sqrt{n} (P_{n} - ℙ)$ and ||G_n||_{ℳ_δ} = sup_f_{∈ℳ_δ} |𝔾_nf|. Then,

sup_{ξ \in B_{δ}} | H_{n} (ξ) | = n^{- 1 / 2} {‖ G_{n} ‖}_{M_{δ}} .

Therefore, it suffices to show that

{‖ {‖ G_{n} ‖}_{M_{δ}} ‖}_{P, 2} = O (δ^{\frac{2 r + 1}{2 (r + 1)}}) .

We prove this result by using Theorem 2.14.1 of van der Vaart and Wellner (1996) and exploiting the covering numbers of ℳ_δ. For statistical applications of covering numbers, please see Chapter 2 of van der Vaart and Wellner (1996). This result can be achieved by showing that

log N (ε, M_{δ}, {‖ \cdot ‖}_{P, 2}) ≲ ε^{- 1 / r} log (\frac{4 δ + ε}{ε}) .

Suppose that there exist ξ₁, …, ξ_m ∈ B_δ such that, for any ξ ∈ B_δ, min_1≤_i_≤_m ||ξ − ξ_i||_P_,2 ≤ ε. Observe that

min_{1 \leq i \leq m} E_{W, Y} {[(Y - \dot{b} (η_{W} (ξ_{0}))) η_{W} (ξ - ξ_{0}) - (Y - \dot{b} (η_{W} (ξ_{0}))) η_{W} (ξ_{i} - ξ_{0})]}^{2} \leq C ε^{2} .

Therefore, the cover number for ℳ_δ is of the same order of that for B_δ, and specifically,

N (ε, M_{δ}, {‖ \cdot ‖}_{P, 2}) = N (\frac{ε}{C}, B_{δ}, d_{W}) .

Write β = Σ_k γ_kϕ_k, $β_{0} = \sum_{k} γ_{k}^{0} ϕ_{k}$ , and $X = \sum_{k} ρ_{k}^{1 / 2} ξ_{k} ϕ_{k}$ . Hence, $d_{W}^{2} (ξ, ξ_{0}) = {‖ θ - θ_{0} ‖}_{2}^{2} + \sum_{k} ρ_{k} {(γ_{k} - γ_{k}^{0})}^{2}$ . For any β = Σ_k γ_kϕ_k, ∈ B_δ, let β^* =Σ_k_≤_M γ₍_k₎ϕ₍_k₎ with M = ε^−1/(^r⁺¹⁾. Since ρ₍_l₎ is of order l⁻²^q, Σ_ℓ>_s ρ_(ℓ) is of order s⁻²^r with r = q − 1/2 > 0. Therefore,

E ({〈 X, β - β^{*} 〉}^{2}) = \sum_{k > M} γ_{(k)}^{2} ρ_{(k)} ≲ \frac{1}{M^{2}} M^{- 2 r} = M^{- 2 (r + 1)} = ε^{2} .

So if we can find $β_{1}^{*}, \dots, β_{m}^{*} \in B_{δ}^{*}$ , where $B_{δ}^{*} = {β = \sum_{k \leq M} γ_{(k)} ϕ_{(k)} : \sum_{k \leq M} ρ_{(k)} {(γ_{(k)} - γ_{(k)}^{0})}^{2} \leq δ^{2}}$ , satisfying for all $β^{*} \in B_{δ}^{*}$ ,

min_{1 \leq k \leq m} E ({〈 X, β^{*} - β_{k}^{*} 〉}^{2}) \leq ε^{2},

it also guarantees that, for any β ∈ B_δ,

min_{1 \leq k \leq m} E ({〈 X, β - β_{k}^{*} 〉}^{2}) ≲ min_{1 \leq k \leq m} {E ({〈 X, β - β^{*} 〉}^{2}) + E ({〈 X, β^{*} - β_{k}^{*} 〉}^{2})} ≲ ε^{2} .

In addition, since θ ∈ Θ which is a bounded subset of ℝ^p, we may find θ₁, …, θ_m̃ such that, for any θ ∈ Θ, ${min}_{1 \leq k \leq \tilde{m}} {‖ θ - θ_{k} ‖}_{2}^{2} \leq ε^{2}$ .

Since it is known that the covering number for a ball in ℝ^M is $N (ε, B_{δ}^{*}, d_{X}) ≲ {(4 δ + ε) / ε}^{M}$ , it follows from the above arguments that

log N (ε, M_{δ}, {‖ \cdot ‖}_{P, 2}) ≲ (M + p) log (\frac{4 δ + ε}{ε}) ≲ ε^{- \frac{1}{r + 1}} log (\frac{4 δ + ε}{ε}) .

We calculate J(1, ℳ_δ) by

J (1, M_{δ}) = \int_{0}^{1} \sqrt{1 + log N (ε, M_{δ}, {‖ \cdot ‖}_{P, 2})} d ε ≲ δ^{\frac{2 r + 1}{2 (r + 1)}} .

It follows from Theorem 2.14.1 in van der Vaart and Wellner (1996), we have ${‖ {‖ G_{n} ‖}_{M_{δ}} ‖}_{P, 2} = O (δ^{\frac{2 r + 1}{2 (r + 1)}})$ , which completes the proof of the lemma.

Footnotes

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Contributor Information

Xiao Wang, Associate Professor of Statistics, Department of Statistics, Purdue University, West Lafayette, IN 47907.

Hongtu Zhu, Professor of Biostatistics, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77230, and University of North Carolina, Chapel Hill, NC 27599.

References

1.Candès W, Romberg J, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Comm Pure App Math. 2006a;59:1027–1023. [Google Scholar]
2.Candès W, Romberg J, Tao T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inform Theory. 2006b;52:489–509. [Google Scholar]
3.Chen S, Donoho DL, Saunders M. Atomic decomposition for basis pursuit. SIAM J Sci Comp. 1998;20:33–61. [Google Scholar]
4.Colom R, Stein JL, Rajagopalan P, Martńez K, Hermel D, Wang Y, Alvarez Linera J, Burgaleta M, Quiroga AM, Shih PC, Thompson PM. Hippocampal structure and human cognition: key role of spatial processing and evidence supporting the efficiency hypothesis in females. Intelligence. 2013;41:129–140. doi: 10.1016/j.intell.2013.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Crambes C, Kneip A, Sarda P. Smoothing splines estimators for functional linear regression. Annals of Statistics. 2009;37:35–72. [Google Scholar]
6.De La Torre JC. Alzheimers disease is incurable but preventable. Journal of Alzheimers Disease. 2010;20:861–870. doi: 10.3233/JAD-2010-091579. [DOI] [PubMed] [Google Scholar]
7.Du P, Wang X. Penalized likelihood functional regression. Statistica Sinca. 2014;24:1017–1041. [Google Scholar]
8.Efron B. How biased is the apparent error rate of a prediction rule? J Amer Statist Assoc. 1986;81:461–470. [Google Scholar]
9.Efron B. The estimation of prediction error: Covariance penalties and cross-validation (with discussion) J Amer Statist Assoc. 2004;99:619–642. [Google Scholar]
10.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
11.Fennema-Notestine C, Hagler DJ, Jr, McEvoy LK, Fleisher AS, Wu EH, Karow DS, Dale AM Alzheimer’s Disease Neuroimaging Initiative. Structural MRI biomarkers for preclinical and mild Alzheimer’s disease. Human Brain Mapping. 2009;30:3238–3253. doi: 10.1002/hbm.20744. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ferraty F, Vieu P. Nonparametric Functional Data Analysis: Theory and Practice. Springer-Verlag Inc; New York: 2006. [Google Scholar]
13.Friedman JT, Hastie H, Tibshirani R. Pathwise coordinate optimization. Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
14.Gertheiss J, Maity A, Staicu AM. Variable selection in generalized functional linear model. Stat. 2013;2:86–101. doi: 10.1002/sta4.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Goldsmith J, Bobb J, Crainiceanu CM, Caffo B, Reich D. Penalized functional regression. Journal of Computational and Graphical Statistics. 2010;20:830–851. doi: 10.1198/jcgs.2010.10007. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Goldsmith J, Huang L, Crainiceanu CM. Smooth scalar-on-image regression via spatial Bayesian variable selection. Journal of Computational and Graphical Statistics. 2014;23:46–64. doi: 10.1080/10618600.2012.743437. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Guillas S, Lai MJ. Bivariate splines for spatial functional regression models. Journal of Nonparametric Statistics. 2010;22:477–497. [Google Scholar]
18.Hall P, Horowitz JL. Methodology and convergence rates for functional linear regression. Annals of Statistics. 2007;35:70–91. [Google Scholar]
19.James GM. Generalized linear models with functional predictors. Journal of the Royal Statistical Society, Series B. 2002;64:411–432. [Google Scholar]
20.James GM, Wang J, Zhu J. Functional linear regression that’s interpretable. Annals of Statistics. 2009;37:2083–2108. [Google Scholar]
21.Hestenes MR. Multiplier and gradient methods, Journal of Optimization Theory and Applications. In: Zadeh LA, Neustadt LW, Balakrishnan AV, editors. Computing Methods in Optimization Problems. Vol. 4. Academic Press; New York: 1969. pp. 303–320. [Google Scholar]
22.Li Y, Wang N, Carroll RJ. Generalized functional linear models with semi parametric single-index interactions. Journal of the American Statistical Association. 2010;105:621–633. doi: 10.1198/jasa.2010.tm09313. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Li C. PhD Dissertation. Rice University; 2013. Compressive Sensing for 3D Data Processing Tasks: Applications, Model and Algorithms. [Google Scholar]
24.Lorensen WE, Cline HE. Marching cubes: a high resolution 3D surface construction algorithm. Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 87; 1987. pp. 163–169. [Google Scholar]
25.Luders E, Thompson PM, Kurth F, Hong JY, Phillips OR, Wang Y, Gutman BA, Chou YY, Narr KL, Toga AW. Global and regional alterations of hippocampal anatomy in long-term meditation practitioners. Hum Brain Mapp. 2013;34:3369–3375. doi: 10.1002/hbm.22153. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Mammen E, van de Geer S. Locally adaptive regression splines. Annals of Statistics. 1997;25:387–413. [Google Scholar]
27.Michel V, Gramfort A, Varoquaux G, Eger E, Thirion B. Total variation regularization for fMRI-based prediction of behavior. IEEE Transactions on Medical Imaging. 2011;30:1328–1340. doi: 10.1109/TMI.2011.2113378. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Mu Y, Gage F. Adult hippocampal neurogenesis and its role in Alzheimers disease. Molecular Neurodegeneration. 2011;6:85. doi: 10.1186/1750-1326-6-85. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Müller HG, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]
30.Needell D, Ward R. Stable image reconstruction using total variation minimization. SIAM J Imaging Sciences. 2013;6:1035–1058. [Google Scholar]
31.Nelder J, Wedderburn R. Generalized linear models. Journal of the Royal Statistical Society, Series A. 1972;135:370–384. [Google Scholar]
32.Patenaude B, Smith SM, Kennedy DN, Jenkinson M. A Bayesian model of shape and appearance for subcortical brain segmentation. NeuroImage. 2011;56:907–922. doi: 10.1016/j.neuroimage.2011.02.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Petrushev PP, Cohen A, Xu H, DeVore R. Nonlinear approximation and the space BV(R2) American Journal of Mathematics. 1999;121:587–628. [Google Scholar]
34.Powell MJD. A Method for Nonlinear Constraints in Minimization Problems. In: Fletcher R, editor. Optimization. Academic Press; London, New York: 1969. pp. 283–298. [Google Scholar]
35.Ramsay JO, Silverman BW. Functional Data Analysis. Springer-Verlag Inc; New York: 2005. [Google Scholar]
36.Reiss PT, Huo L, Zhao Y, Kelly C, Ogden RT. Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Annals of Applied Statistics. 2015;9(2):1076–1101. doi: 10.1214/15-AOAS829. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Reiss PT, Ogden RT. Functional principal component regression and functional partial least squares. Journal of the American Statistical Association. 2007;102:984–996. [Google Scholar]
38.Reiss PT, Ogden RT. Functional generalized linear models with images as predictors. Biometrics. 2010;66:61–69. doi: 10.1111/j.1541-0420.2009.01233.x. [DOI] [PubMed] [Google Scholar]
39.Rudin LI, Osher S. Total variation based image restoration with free local constraints. Proc 1st IEEE ICIP. 1994;1:31–35. [Google Scholar]
40.Rudin LI, Osher S, Fatemi E. Nonlinear total variation noise removal algorithm. Physica D. 1992;60:259–268. [Google Scholar]
41.Shi J, Lepore N, Gutman B, Thompson PM, Baxter L, Caselli RJ, Wang Y. Genetic influence of APOE4 genotype on hippocampal morphometry - an N=725 surface-based ADNI Study. Hum Brain Mapp. 2014;35:3902–3918. doi: 10.1002/hbm.22447. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Shi J, Thompson PM, Gutman B, Wang Y. Surface fluid registration of conformal representation: application to detect disease burden and genetic influence on hippocampus. NeuroImage. 2013;78:111–134. doi: 10.1016/j.neuroimage.2013.04.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Tibshirani R. Regression shrinkage and selection via the Lasso. J of Royal Statis Soc B. 1996;58:267–288. [Google Scholar]
44.Tibshirani R. Adaptive piecewise polynomial estimation via trend filtering. Annals of Statistics. 2014;42:285–323. [Google Scholar]
45.Tibshirani R, Taylor J. Degrees of freedom in Lasso problems. Annals of Statistics. 2012;40:1198–1232. [Google Scholar]
46.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. JR Statist SocB. 2005;67:91–108. [Google Scholar]
47.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996. [Google Scholar]
48.Vidakovic B. Statistical Modeling by Wavelets. Wiley; New York: 1999. [Google Scholar]
49.Wang X, Nan B, Zhu J, Koppe R ADNI. Regularized 3D functional regression for brain image data via Haar wavelets. Annals of Applied Statistics. 2014;8:1045–1064. doi: 10.1214/14-AOAS736. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Wang Y, Zhang J, Gutman B, Chan TF, Becker JT, Aizenstein HJ, Lopez OL, Tamburo RJ, Toga AW, Thompson PM. Multivariate tensor based morphometry on surfaces: Application to mapping ventricular abnormalities in HIV/AIDS. NeuroImage. 2010;49:2141–2157. doi: 10.1016/j.neuroimage.2009.10.086. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Weiner MW, Veitcha DP, Aisen PS, Beckett LA, Cairnsh NJ, Green RC, Harvey D, Jack CR, Jagust W, Liu E, Morris JC, Petersen RC, Saykino AJ, Schmidt ME, Shaw L, Siuciak JA, Soares H, Toga AW, Trojanowski JQ ADNI. The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimers Dement. 2012;8:S1–S68. doi: 10.1016/j.jalz.2011.09.172. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Yuan M, Cai TT. A reproducing kernel Hilbert space approach to functional linear regression. Annals of Statistics. 2010;38:3412–3444. [Google Scholar]
53.Zhao Y, Ogden RT, Reiss PT. Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics. 2014;21:600–617. doi: 10.1080/10618600.2012.679241. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Zhou H, Li L. Regularized matrix regression. Journal of Royal Statistical Society Series B. 2014;76:463–483. doi: 10.1111/rssb.12031. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Zhou H, Li L, Zhu H. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association. 2013;108:540–552. doi: 10.1080/01621459.2013.776499. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Zhu H, Fan J, Kong L. Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association. 2014;109:977–990. doi: 10.1080/01621459.2014.881742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Candès W, Romberg J, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Comm Pure App Math. 2006a;59:1027–1023. [Google Scholar]

[R2] 2.Candès W, Romberg J, Tao T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inform Theory. 2006b;52:489–509. [Google Scholar]

[R3] 3.Chen S, Donoho DL, Saunders M. Atomic decomposition for basis pursuit. SIAM J Sci Comp. 1998;20:33–61. [Google Scholar]

[R4] 4.Colom R, Stein JL, Rajagopalan P, Martńez K, Hermel D, Wang Y, Alvarez Linera J, Burgaleta M, Quiroga AM, Shih PC, Thompson PM. Hippocampal structure and human cognition: key role of spatial processing and evidence supporting the efficiency hypothesis in females. Intelligence. 2013;41:129–140. doi: 10.1016/j.intell.2013.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Crambes C, Kneip A, Sarda P. Smoothing splines estimators for functional linear regression. Annals of Statistics. 2009;37:35–72. [Google Scholar]

[R6] 6.De La Torre JC. Alzheimers disease is incurable but preventable. Journal of Alzheimers Disease. 2010;20:861–870. doi: 10.3233/JAD-2010-091579. [DOI] [PubMed] [Google Scholar]

[R7] 7.Du P, Wang X. Penalized likelihood functional regression. Statistica Sinca. 2014;24:1017–1041. [Google Scholar]

[R8] 8.Efron B. How biased is the apparent error rate of a prediction rule? J Amer Statist Assoc. 1986;81:461–470. [Google Scholar]

[R9] 9.Efron B. The estimation of prediction error: Covariance penalties and cross-validation (with discussion) J Amer Statist Assoc. 2004;99:619–642. [Google Scholar]

[R10] 10.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R11] 11.Fennema-Notestine C, Hagler DJ, Jr, McEvoy LK, Fleisher AS, Wu EH, Karow DS, Dale AM Alzheimer’s Disease Neuroimaging Initiative. Structural MRI biomarkers for preclinical and mild Alzheimer’s disease. Human Brain Mapping. 2009;30:3238–3253. doi: 10.1002/hbm.20744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Ferraty F, Vieu P. Nonparametric Functional Data Analysis: Theory and Practice. Springer-Verlag Inc; New York: 2006. [Google Scholar]

[R13] 13.Friedman JT, Hastie H, Tibshirani R. Pathwise coordinate optimization. Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]

[R14] 14.Gertheiss J, Maity A, Staicu AM. Variable selection in generalized functional linear model. Stat. 2013;2:86–101. doi: 10.1002/sta4.20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Goldsmith J, Bobb J, Crainiceanu CM, Caffo B, Reich D. Penalized functional regression. Journal of Computational and Graphical Statistics. 2010;20:830–851. doi: 10.1198/jcgs.2010.10007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Goldsmith J, Huang L, Crainiceanu CM. Smooth scalar-on-image regression via spatial Bayesian variable selection. Journal of Computational and Graphical Statistics. 2014;23:46–64. doi: 10.1080/10618600.2012.743437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Guillas S, Lai MJ. Bivariate splines for spatial functional regression models. Journal of Nonparametric Statistics. 2010;22:477–497. [Google Scholar]

[R18] 18.Hall P, Horowitz JL. Methodology and convergence rates for functional linear regression. Annals of Statistics. 2007;35:70–91. [Google Scholar]

[R19] 19.James GM. Generalized linear models with functional predictors. Journal of the Royal Statistical Society, Series B. 2002;64:411–432. [Google Scholar]

[R20] 20.James GM, Wang J, Zhu J. Functional linear regression that’s interpretable. Annals of Statistics. 2009;37:2083–2108. [Google Scholar]

[R21] 21.Hestenes MR. Multiplier and gradient methods, Journal of Optimization Theory and Applications. In: Zadeh LA, Neustadt LW, Balakrishnan AV, editors. Computing Methods in Optimization Problems. Vol. 4. Academic Press; New York: 1969. pp. 303–320. [Google Scholar]

[R22] 22.Li Y, Wang N, Carroll RJ. Generalized functional linear models with semi parametric single-index interactions. Journal of the American Statistical Association. 2010;105:621–633. doi: 10.1198/jasa.2010.tm09313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Li C. PhD Dissertation. Rice University; 2013. Compressive Sensing for 3D Data Processing Tasks: Applications, Model and Algorithms. [Google Scholar]

[R24] 24.Lorensen WE, Cline HE. Marching cubes: a high resolution 3D surface construction algorithm. Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 87; 1987. pp. 163–169. [Google Scholar]

[R25] 25.Luders E, Thompson PM, Kurth F, Hong JY, Phillips OR, Wang Y, Gutman BA, Chou YY, Narr KL, Toga AW. Global and regional alterations of hippocampal anatomy in long-term meditation practitioners. Hum Brain Mapp. 2013;34:3369–3375. doi: 10.1002/hbm.22153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Mammen E, van de Geer S. Locally adaptive regression splines. Annals of Statistics. 1997;25:387–413. [Google Scholar]

[R27] 27.Michel V, Gramfort A, Varoquaux G, Eger E, Thirion B. Total variation regularization for fMRI-based prediction of behavior. IEEE Transactions on Medical Imaging. 2011;30:1328–1340. doi: 10.1109/TMI.2011.2113378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Mu Y, Gage F. Adult hippocampal neurogenesis and its role in Alzheimers disease. Molecular Neurodegeneration. 2011;6:85. doi: 10.1186/1750-1326-6-85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Müller HG, Stadtmüller U. Generalized functional linear models. Annals of Statistics. 2005;33:774–805. [Google Scholar]

[R30] 30.Needell D, Ward R. Stable image reconstruction using total variation minimization. SIAM J Imaging Sciences. 2013;6:1035–1058. [Google Scholar]

[R31] 31.Nelder J, Wedderburn R. Generalized linear models. Journal of the Royal Statistical Society, Series A. 1972;135:370–384. [Google Scholar]

[R32] 32.Patenaude B, Smith SM, Kennedy DN, Jenkinson M. A Bayesian model of shape and appearance for subcortical brain segmentation. NeuroImage. 2011;56:907–922. doi: 10.1016/j.neuroimage.2011.02.046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Petrushev PP, Cohen A, Xu H, DeVore R. Nonlinear approximation and the space BV(R2) American Journal of Mathematics. 1999;121:587–628. [Google Scholar]

[R34] 34.Powell MJD. A Method for Nonlinear Constraints in Minimization Problems. In: Fletcher R, editor. Optimization. Academic Press; London, New York: 1969. pp. 283–298. [Google Scholar]

[R35] 35.Ramsay JO, Silverman BW. Functional Data Analysis. Springer-Verlag Inc; New York: 2005. [Google Scholar]

[R36] 36.Reiss PT, Huo L, Zhao Y, Kelly C, Ogden RT. Wavelet-domain regression and predictive inference in psychiatric neuroimaging. Annals of Applied Statistics. 2015;9(2):1076–1101. doi: 10.1214/15-AOAS829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Reiss PT, Ogden RT. Functional principal component regression and functional partial least squares. Journal of the American Statistical Association. 2007;102:984–996. [Google Scholar]

[R38] 38.Reiss PT, Ogden RT. Functional generalized linear models with images as predictors. Biometrics. 2010;66:61–69. doi: 10.1111/j.1541-0420.2009.01233.x. [DOI] [PubMed] [Google Scholar]

[R39] 39.Rudin LI, Osher S. Total variation based image restoration with free local constraints. Proc 1st IEEE ICIP. 1994;1:31–35. [Google Scholar]

[R40] 40.Rudin LI, Osher S, Fatemi E. Nonlinear total variation noise removal algorithm. Physica D. 1992;60:259–268. [Google Scholar]

[R41] 41.Shi J, Lepore N, Gutman B, Thompson PM, Baxter L, Caselli RJ, Wang Y. Genetic influence of APOE4 genotype on hippocampal morphometry - an N=725 surface-based ADNI Study. Hum Brain Mapp. 2014;35:3902–3918. doi: 10.1002/hbm.22447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Shi J, Thompson PM, Gutman B, Wang Y. Surface fluid registration of conformal representation: application to detect disease burden and genetic influence on hippocampus. NeuroImage. 2013;78:111–134. doi: 10.1016/j.neuroimage.2013.04.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Tibshirani R. Regression shrinkage and selection via the Lasso. J of Royal Statis Soc B. 1996;58:267–288. [Google Scholar]

[R44] 44.Tibshirani R. Adaptive piecewise polynomial estimation via trend filtering. Annals of Statistics. 2014;42:285–323. [Google Scholar]

[R45] 45.Tibshirani R, Taylor J. Degrees of freedom in Lasso problems. Annals of Statistics. 2012;40:1198–1232. [Google Scholar]

[R46] 46.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. JR Statist SocB. 2005;67:91–108. [Google Scholar]

[R47] 47.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer; New York: 1996. [Google Scholar]

[R48] 48.Vidakovic B. Statistical Modeling by Wavelets. Wiley; New York: 1999. [Google Scholar]

[R49] 49.Wang X, Nan B, Zhu J, Koppe R ADNI. Regularized 3D functional regression for brain image data via Haar wavelets. Annals of Applied Statistics. 2014;8:1045–1064. doi: 10.1214/14-AOAS736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Wang Y, Zhang J, Gutman B, Chan TF, Becker JT, Aizenstein HJ, Lopez OL, Tamburo RJ, Toga AW, Thompson PM. Multivariate tensor based morphometry on surfaces: Application to mapping ventricular abnormalities in HIV/AIDS. NeuroImage. 2010;49:2141–2157. doi: 10.1016/j.neuroimage.2009.10.086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Weiner MW, Veitcha DP, Aisen PS, Beckett LA, Cairnsh NJ, Green RC, Harvey D, Jack CR, Jagust W, Liu E, Morris JC, Petersen RC, Saykino AJ, Schmidt ME, Shaw L, Siuciak JA, Soares H, Toga AW, Trojanowski JQ ADNI. The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimers Dement. 2012;8:S1–S68. doi: 10.1016/j.jalz.2011.09.172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Yuan M, Cai TT. A reproducing kernel Hilbert space approach to functional linear regression. Annals of Statistics. 2010;38:3412–3444. [Google Scholar]

[R53] 53.Zhao Y, Ogden RT, Reiss PT. Wavelet-based LASSO in functional linear regression. Journal of Computational and Graphical Statistics. 2014;21:600–617. doi: 10.1080/10618600.2012.679241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Zhou H, Li L. Regularized matrix regression. Journal of Royal Statistical Society Series B. 2014;76:463–483. doi: 10.1111/rssb.12031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Zhou H, Li L, Zhu H. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association. 2013;108:540–552. doi: 10.1080/01621459.2013.776499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Zhu H, Fan J, Kong L. Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association. 2014;109:977–990. doi: 10.1080/01621459.2014.881742. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Generalized Scalar-on-Image Regression Models via Total Variation

Xiao Wang

Hongtu Zhu

Abstract

1 Introduction

Figure 1.

2 Linear scalar-on-image regression model

2.1 The space of bounded variation

Figure 2.

2.2 Estimation

2.3 The error bound

Theorem 2.1

Remark 2.1

Remark 2.2

3 Generalized scalar-on-image regression models

3.1 Estimation

3.2 The error bound

Theorem 3.1

4 Simulation Studies

Figure 3.

Figure 4.

Figure 5.

Table 1.

5 Real data analysis

Table 2.

Figure 7.

6 Conclusion

Figure 6.

Acknowledgments

7 Appendix

7.1 Proof of Theorem 2.1

Lemma 7.1

Proof

7.2 Proof of Theorem 3.1

Lemma 7.2

Proof

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases