Spatially Weighted Principal Component Regression for High-dimensional Prediction

Dan Shen; Hongtu Zhu

doi:10.1007/978-3-319-19992-4_60

. Author manuscript; available in PMC: 2015 Dec 23.

Published in final edited form as: Inf Process Med Imaging. 2015 Jun 23;9123:758–769. doi: 10.1007/978-3-319-19992-4_60

Spatially Weighted Principal Component Regression for High-dimensional Prediction

Dan Shen ¹, Hongtu Zhu ²

PMCID: PMC4511401 NIHMSID: NIHMS677544 PMID: 26213452

Abstract

We consider the problem of using high dimensional data residing on graphs to predict a low-dimensional outcome variable, such as disease status. Examples of data include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices), among many others. Many of these data have two key features including spatial smoothness and intrinsically low dimensional structure. We propose a simple solution based on a general statistical framework, called spatially weighted principal component regression (SWPCR). In SWPCR, we introduce two sets of weights including importance score weights for the selection of individual features at each node and spatial weights for the incorporation of the neighboring pattern on the graph. We integrate the importance score weights with the spatial weights in order to recover the low dimensional structure of high dimensional data. We demonstrate the utility of our methods through extensive simulations and a real data analysis based on Alzheimer’s disease neuroimaging initiative data.

Keywords: Graph, Principal component analysis, Regression, Spatial, Supervise, Weight

1 Introduction

Our problem of interest is to predict a set of response variables Y by using high-dimensional data x = {x_g : g ∈ Inline graphic } measured on a graph ζ = ( , ), where is the edge set of ζ and = {g₁, …, g_m} is a set of vertexes, in which m is the total number of vertexes in . The response Y may include cognitive outcome, disease status, and the early onset of disease, among others. Standard graphs including both directed and undirected graphs have been widely used to build complex patterns [10]. Examples of graphs are linear graphs, tree graphs, triangulated graphs, and 2-dimensional (2D) (or 3-dimensional (3D)) lattices, among many others (Figure 1). Examples of x on the graph ζ = ( Inline graphic , ) include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices). Particularly, various structural and functional neuroimaging data are frequently measured in a 3D lattice for the understanding of brain structure and function and their association with neuropsychiatric and neurodegenerative disorders [9].

Fig. 1 — Illustration of graph data structure ζ = ( , ): (a) two-dimensional lattice; (b) acyclic directed graph; (c) tree; (d) undirected graph.

The aim of this paper is to develop a new framework of spatially weighted principal component regression (SWPCR) to use x on graph ζ = { Inline graphic , } to predict Y. Four major challenges arising from such development include ultra-high dimensionality, low sample size, spatially correlation, and spatial smoothness. SWPCR is developed to address these four challenges when high-dimensional data on graphs ζ share two important features including spatial smoothness and intrinsically low dimensional structure. Compared with the existing literature, we make several major contributions as follows:

(i) SWPCR is designed to efficiently capture the two important features by using some recent advances in smoothing methods, dimensional reduction methods, and sparse methods.
(ii) SWPCR provides a powerful dimension reduction framework for integrating feature selection, smoothing, and feature extraction.
(iii) SWPCR significantly outperforms the competing methods by simulation studies and the real data analysis.

2 Spatially Weighted Principal Component Regression

In this section, we first describe the graph data that are considered in this paper. We formally describe the general framework of SWPCR.

2.1 Graph Data

Consider data from n independent subjects. For each subject, we observe a q × 1 vector of discrete or continuous responses, denoted by y_i = (y_i_,1, …, y_i_,_q)^T, and a m × 1 vector of high dimensional data x_i = {x_i_,_g : g ∈ Inline graphic } for i = 1, …, n. In many cases, q is relatively small compared with n, whereas m is much larger than n. For instance, in many neuroimaging studies, it is common to use ultra-high dimensional imaging data to classify a binary class variable. In this case, q = 1, whereas m can be several million number of features. In many applications, Inline graphic = {g₁, …, g_m} is a set of prefixed vertexes, such as voxels in 2D or 3D lattices, whereas the edge set may be either prefixed or determined by x_i (or other data).

2.2 SWPCR

We introduce a three-stage algorithm for SWPCR to use high-dimensional data x to predict a set of response variables Y. The key stages of SWPCR can be described as follows.

Stage 1. Build an importance score vector (or function) W_I: → R⁺ and the spatial weight matrix (or function) W_E: × → R.
Stage 2. Build a sequence of scale vectors {s₀ = (s_E_,0, s_I_,0), ···, s_L = (s_E_,_L, s_I_,_L)} ranging from the smallest scale vector s₀ to the largest scale vector s_L. At each scale vector s_ℓ, use generalized principal component analysis (GPCA) to compute the first few principal components of an n × m matrix X = (x₁ ··· x_n)^T, denoted by A(s_ℓ), based on W_E(·, ·) and W_I(·) for ℓ = 0, …, L.
Stage 3. Select the optimal 0 ≤ ℓ^* ≤ L and build a prediction model (e.g., high-dimensional linear model) based on the extracted principal components A(s_ℓ^*) and the responses Y.

We slightly elaborate on these stages. In Stage 1, the important scores w_I_,_g play an important feature screening role in SWPCR. Examples of w_I_,_g = W_I(g) in the literature can be generated based on some statistics (e.g., Pearson correlation or distance correlation) between x_g and Y at each vertex g. For instance, let p(g) be the Pearson correlation at each vertex g and then define

w_{I, g} = - m log (p (g)) / [- \sum_{g \in G} log (p (g))] .

(1)

In Stage 1, without loss of generality, we focus on the symmetric matrix W_E = (w_E_,_gg_′) ∈ R^p^×^p throughout the paper. The element w_E_,_gg_′ is usually calculated by using various similarity criteria, such as Gaussian similarity from Euclidean distance, local neighborhood relationship, correlation, and prior information obtained from other data [21]. In Section 2.3, we will discuss how to determine W_E and W_I while explicitly accounting for the complex spatial structure among different vertexes.

In Stage 2, at each scale vector s_ℓ = (s_E_,ℓ, s_I_,ℓ), we construct two matrices, denoted by Q_E_,ℓ and Q_I_,ℓ based on W_E and W_I as follows:

Q_{E, ℓ} = F_{1} (W_{E}, s_{E, ℓ}) and Q_{I, ℓ} = diag (F_{2} (W_{I}, s_{I, ℓ})),

(2)

where F₁ : R^p^×^p × R⁺ → R^p^×^p and F₂ : R^p × R⁺ → R^p are two known functions. For instance, let 1(·) be an indicator function, we may set

F_{2} (W_{I}, s_{I, ℓ}) = {(1 (w_{I, g_{1}} \geq s_{I, ℓ}), \dots, 1 (w_{I, g_{m}} \geq s_{I, ℓ}))}^{T},

(3)

to extract ‘significant’ vertexes. There are various ways of constructing Q_E_,ℓ. For instance, one may set Q_E_,ℓ as

Q_{E, ℓ} = (∣ w_{E, g g^{'}} ∣ 1 (∣ w_{E, g g^{'}} ∣ \geq s_{E, ℓ; 1}, D (g, g^{'}) \leq s_{E, ℓ; 2})),

where s_E_,ℓ = (s_E_,ℓ;1, s_E_,ℓ;2)^T and D(g, g′) is a graph-based distance between vertexes g and g′. The value of s_E_,ℓ;2 controls the number of vertexes in {g′ ∈ Inline graphic : D(g, g′) ≤ s_E_,ℓ;2}, which is a patch set at vertex g [18], whereas s_E_,ℓ;1 is used to shrink small |w_E_,_gg_′|s into zero.

After determining Q_E_,ℓ and Q_I_,ℓ, we set $\sum_{c} = Q_{E, ℓ} Q_{I, ℓ} Q_{I, ℓ}^{T} Q_{E, ℓ}^{T}$ and Σ_r = I_n for independent subjects. Let X̃ be the centered matrix of X. Then we can extract K principal components through minimize the following objective function given by

{‖ \tilde{X} - U {D V}^{T} ‖}^{2} subject to U^{T} \sum_{r} U = V^{T} \sum_{c} V = I_{K} and diag (D) \geq 0.

(4)

If we consider correlated observations from multiple subjects, we may use Σ_r to explicitly model their correlation structure. The solution (U_ℓ, D_ℓ, V_ℓ) of the objective function (4) at s_ℓ is the SVD of X̃_R_,ℓ = X̃Q_E_,ℓQ_I_,ℓ. The we can use a GPCA algorithm to simultaneously calculate all components of (U_ℓ, D_ℓ, V_ℓ) for a fixed K as follows. In practice, a simple criterion for determining K is to include all components up to some arbitrary proportion of the total variance, say 85%.

For ultra-high dimensional data, we consider a regularized GPCA to generate (U_ℓ, D_ℓ, V_ℓ) by minimizing the following objective function

{‖ {\tilde{X}}_{R, ℓ} - \sum_{k = 1}^{K} d_{k, ℓ} u_{k, ℓ} v_{k, ℓ}^{T} ‖}^{2} + λ_{u} \sum_{k = 1}^{K} P_{1} (d_{k, ℓ} u_{k, ℓ}) + λ_{v} \sum_{k = 1}^{K} P_{2} (d_{k, ℓ} v_{k, ℓ})

(5)

subject to $u_{k, ℓ}^{T} u_{k, ℓ} \leq 1$ and $v_{k, ℓ}^{T} v_{k, ℓ} \leq 1$ for all k, where u_k_,ℓ and v_k_,ℓ are respectively the k-th column of U_ℓ and V_ℓ. We use adaptive Lasso penalties for P₁(·) and P₂(·) and then iteratively solve (5) [1]. For each k₀, we define $E_{ℓ, k_{0}} = {\tilde{X}}_{R, ℓ} - \sum_{k \neq k_{0}} d_{k, ℓ} u_{k, ℓ} v_{k, ℓ}^{T}$ and minimize

{‖ E_{ℓ, k_{0}} - d_{k_{0}, ℓ} u_{k_{0}, ℓ} v_{k_{0}, ℓ}^{T} ‖}^{2} + λ_{u} P_{1} (d_{k_{0}, ℓ} u_{k_{0}, ℓ}) + λ_{v} P_{2} (d_{k_{0}, ℓ} v_{k_{0}, ℓ})

(6)

subject to $u_{k_{0}, ℓ}^{T} u_{k_{0}, ℓ} \leq 1$ and $v_{k_{0}, ℓ}^{T} v_{k_{0}, ℓ} \leq 1$ . By using the sparse method in [12], we can calculate the solution of (6), denoted by (d̂_k₀,ℓ, û_k₀,ℓ, v̂_k₀,ℓ). In this way, we can sequentially compute (d̂_k_,ℓ, û_k_,ℓ, v̂_k_,ℓ) for k = 1, …;, K.

In Stage 3, select ℓ^* as the minimum point of the objective function (5) or (6). let $Q_{F, ℓ^{*}} = Q_{E, ℓ^{*}} Q_{I, ℓ^{*}} V_{ℓ^{*}} D_{ℓ^{*}}^{- 1}$ and then K principal components A(s_ℓ^*) = XQ_F_,ℓ^*. Moreover, K is usually much smaller than min(n, m). Then, we build a regression model with y_i as responses and A_i (the i-th row of A(s_ℓ^*)) as covariates, denoted by R(y_i, A_i; θ), where θ is a vector of unknown (finite-dimensional or nonparametric) parameters. Specifically, based on {(y_i, A_i)}_i_≥1, we use an estimation method to estimate θ as follows:

\hat{θ} = {argmin}_{θ} {ρ (R, θ, {(y_{i}, A_{i})}_{i \geq 1}) + λ P_{3} (θ)},

where ρ(·, ·, ·) is a loss function, which depends on both the regression model and the data, and P₃(·) is a penalty function, such as Lasso. This leads to a prediction model R(y_i, A_i; θ). For instance, for binary response y_i = 1 or 0, we may consider a sparse logistic model given by $logit (P (y_{i} = 1 ∣ A_{i})) = A_{i}^{T} θ$ for R(y_i, A_i; θ).

Given a test feature vector x^*, we can do predictions from our prediction model as follows:

Center each component of x^* by calculating x̃^* = x^* − μ̂_x, in which μ̂_x is the mean and learnt from the training data;
Optimize an objective function based on R(y, x̃^*^TQ_F_,ℓ^*; θ̂) to calculate an estimate of y, denoted by ŷ^*.

Our prediction model is applicable to various regression settings for continuous and discrete responses and multivariate and univariate responses, such as survival data and classification problems.

2.3 Importance Score Weights and Spatial Weights

There are two sets of weights in SWPCR including (i) importance score weights enabling a selective treatment for individual features, and (ii) spatial weights accommodating the underlying spatial dependence among features across neighboring vertexes on graph. Below, we propose the strategy of determining both importance score weights and spatial weights.

Importance Score Weights

As discussed in Section 2.3, at each vertex g, w_I,g, such as the Pearson correlation in (1), is calculated based on a statistical model between x_g and Y in order to perform feature selection according to each feature’s discriminative importance. Statistically, most existing methods use a marginal (or vertex-wise) model by assuming

p (x_{i}, y_{i}) = \prod_{g \in G} p (x_{i, g}, y_{i}; β (g)),

where β = (β(g): g ∈ Inline graphic ) and β(g) is introduced to quantify the association between y_i and x_i,g at each vertex g ∈ . At the g–th vertex, w_I,g is a statistic based on the marginal model $\prod_{i = 1}^{n} p (x_{i, g}, y_{i}; β (g))$ . However, those w_I,gs largely ignore complex spatial structure, such as homogenous patches defined below, across all vertexes on graph.

For a graph ζ = ( Inline graphic , ), it is common to assume that β(g) across all vertexes are naturally clustered into P homogeneous patches, denoted by { : l = 1, …, P}, such that P ≪ m, $G = \cup_{l = 1}^{P} G_{l}$ , and β(g) varies smoothly in each . Note that a patch consists of a set of vertexes that are completely connected through edges in Inline graphic . That is, if g, g′ ∈ , then there is a sequence of vertexes g₀ = g, ···, g_M = g′ in such that (g_j₋₁, g_j) ∈ for all j = 1, …, M. It has been shown that for graph data, algorithms based on patch information have led to state-of-the art techniques for classification and denoising. See for example, [18] for overviews of imaging patches.

We propose the strategy to jointly model x_i and y_i and simultaneously calculate w_I,g across all vertexes, while learning the homogenous patches Inline graphic . The strategy is to model the conditional distribution of x_i given y_i, denoted by p(x_i|y_i, β). Then we can learn the patches in from the estimated β.

Here we consider a set of vertexes Inline graphic with unknown edge information . It is important to learn the homogeneous patches and then form the edge set . Let (h) be an edge set at scale h at each vertex g. We consider a sequence of nested edge sets across multiscales h_s such that h₀ = 0 ≤ h₁ ≤ ··· ≤ h_S and Inline graphic (h₀) = {g} ⊂ ··· ⊂ (h_S). To learn the homogeneous patches, a general framework of Multiscale Adaptive Regression Model (MARM) developed in [13] is to maximize a sequence of weighted functions as follows:

\hat{β} (g; h_{s}) = {argmax}_{β (g)} \sum_{i = 1}^{n} \sum_{g^{'} \in E_{g} (h_{s})} ω (g, g^{'}; h_{s}) log p (x_{i, g^{'}} ∣ y_{i}, β (g)) for s = 1, \dots, S,

(7)

where ω(g, g′; h) characterizes the similarity between the data in vertexes g′ and g with ω(g, g; h) = 1. If ω(g, g′; h) ≈ 0, then the observations in vertex g′ do not provide information on β(g). Therefore, ω(g, g′; h) can prevent incorporation of vertexes whose data do not contain information on β(g) and preserve the edges of homogeneous regions. Let D₁(g, g′) and D₂(β̂(g; h_s₋₁), β̂(g′; h_s₋₁)) be, respectively, the spatial distance between vertexes g and g′ and a similarity measure between β̂(g; h_s₋₁) and β̂(g′; h_s₋₁). The ω(g, g′; h_s) can be defined as

ω (g, g^{'}; h_{s}) = K_{1} (D_{1} (g, g^{'}) / h_{s}) \cdot K_{2} (D_{2} (\hat{β} (g; h_{s - 1}), \hat{β} (g^{'}; h_{s - 1})) / γ_{n}),

(8)

where Inline graphic (·) and (·) are two nonnegative kernel functions and γ_n is a bandwidth parameter that may depend on n. See the detailed algorithm of MARM in [13]. After the iteration h_s, we can obtain β̂(g; h_S) and its covariance matrix, denoted by Cov(β̂(g; h_s)), across all g ∈ Inline graphic and ω(g, g′; h_s) for all g′ ∈ (h_s) and g ∈ . Finally, we calculate statistics w_I,g based on β̂(g; h_s) and Cov(β̂(g; h_s)), such as the Wald test, and then we use a clustering algorithm, such as the K-mean algorithm, to group {β̂(g; h_s): g ∈ } into several homogeneous clusters, in which β̂(g; h_s) varies very smoothly in each cluster. Moreover, each homogenous cluster can be a union of several homogeneous patches.

Spatial Weights

As discussed in Section 2.3, w_E,gg_′ often characterizes the degree of certain ‘similarity’ between vertexes g and g′. The locally spatial weighting matrix consists of non-negative weights assigned to the spatial neighboring vertexes of each vertex. It is assumed that

w_{E, g g^{'}} = \frac{ω (g, g^{'}; h_{s}) 1 (g^{'} \in E_{g} (h_{s}))}{\sum_{g^{'} \in E_{g} (h_{s})} ω (g, g^{'}; h_{s}) 1 (g^{'} \in E_{g} (h_{s}))},

(9)

in which ω(g, g′; h_s) is defined in (8). Therefore, w_E,gg_′= 0 for all g′ ∉ Inline graphic (h_s) and w_E,gg_′ = 1. The weights (D₁(g, g′)/h_s) give less weight to vertex g′ ∈ (h_s), whose location is far from the vertex g. The weights (u) down-weight the vertex g′ with large D₂(β̂(g; h_s), β̂(g′; h_s)), which indicates a large difference between β̂(g′; h_s) and β̂(g; h_s). Moreover, by following [4, 13, 15, 16], we set Inline graphic (x) = (1−x)₊ and (x) = exp(−x). Although m is often much larger than n, the computational burden associated with the local spatial weights is very minor when h_s is relatively small.

3 Simulation Study

In this section, we conducted one set of simulation study corresponding to binary responses, in order to examine the finite-sample performance of SWPCR in the high-dimensional classification analysis. We demonstrate that SWPCR outperforms many state-of-the-art methods for at least in the simulated dataset.

We simulated 20 × 20 × 10 (x × y × z) 3D-images from a linear model given by

x_{i, g} = B_{0} (g) + B_{1} (g) y_{i} + ε_{i} (g) for i = 1, \dots, n,

(10)

where y_i is the class label coded as either 0 or 1 and ε_i(g) are random variables with zero mean. The true mean images of class y_i = 0 and class y_i = 1 are shown in Figure 2. Voxels in the red cuboid region have the maximum difference 1 between classes 0 and 1. The dimension of red cuboid is 3 × 3 × 4 and contains 36 voxels. In this case, m = 4, 000 and we set n = 100 with 60 images from Class 0 and the rest from Class 1. We consider three types of noise ε_i(g) in (10). First, $ε_{i}^{(1)} (g)$ were independently generated from a N(0, 2²) generator across all voxels g. Second, $ε_{i}^{(2)} (g) = \sum_{‖ g^{'} - g ‖ \leq 1} ε_{i}^{1} (g^{'}) / m_{g}$ were generated from $ε_{i}^{(1)} (g)$ in order to introduce the short range spatial correlation, where m_g is the number of voxels in the set {||g′ − g|| ≤ 1}. Third, to introduce long range spatial correlation, $ε_{i}^{(3)} (g)$ were generated according to $ε_{i}^{(3)} (g) = 2 sin (π g_{1} / 10) ξ_{i, 1} + 2 cos (π g_{2} / 10) ξ_{i, 2} + 2 sin (π g_{3} / 5) ξ_{i, 3} + ε_{i}^{(1)} (g)$ , where ξ_i,k for k = 1, 2, 3 were independenly generated from a N(0, 1) generator. Moreover, the noise variances in all voxels of the red cuboid region equal 4, 4/6, and 4{sin(πg₁/10)² +cos(πg₂/10)² +sin(πg₃/5)²} +4 for Type I, II, and III noises, respectively. Therefore, among the three types of noise, Type III noise has the smallest signal-to-noise ratio and Type II noise has the largest one.

Fig. 2 — True mean images for the simulation study: Class 0 in the left panel and Class 1 in the right panel. The white, green, and red colors, respectively, correspond to 0, 1, and 2.

We ran the three stages of SWPCR as follows. In Stage 1, let {h_ℓ = 1.2^ℓ, ℓ = 0, 1, …, S = 5}, and for each g ∈ Inline graphic , w_I,g = −m log(p(g))/[− log(p(g)), where p(g) is the p-value of Wald test B₁(g) = 0 in (7) (β(g) = (B₀(g), B₁(g))^T) for each voxel g. The spatial weight W_E is given by (9). Here we haven’t used the simple Pearson correlation (1) for computing weights because it neglects the spatial correlation of the data set. In Stage 2, for each h_ℓ, we define Q_E,_ℓ = W_E and generate Q_I,_ℓ through (2) and (3), where s_I,_ℓ thresholds out the w_I,g with p(g) < 0.01. Then we extract different K principal components of GPCA to reconstruct the low dimensional representations of simulated images and then do classification analysis. The results are very stable for different number of principal components and here we let K = 5. In Stage 3, we tried different classification methods, including linear regression, k-Nearest Neighbor (k-NN) [11] and support vector machine (SVM) [14], on these low dimensional spaces. Based on the misclassification error for the leave-one-out cross validation, the linear regression is slight better than others. The linear regression uses class label y_i as dependent variable and principal components as explanatory variables. If the prediction value is less than 0, the image is classified as 0. Otherwise, the image is classified as 1.

We compared SWPCR with other state-of-the-art classification methods. The leave-one-out cross validation is used here to calculate the misclassification rates of the different methods. Other classification methods considered here include sparse linear discriminant analysis (sLDA) [6], sparse partial least squares (SPLS) analysis [5], sparse logistic regression (SLR) [20], SVM, and regularized optimal affine discriminant (ROAD) [8]. These methods are well known for their excellent performance in various simulated and real data sets. Inspecting Table 1 reveal that except SWPCR, all classification methods perform pretty poor, when the signal-to-noise ratio is low in those simulated datasets with Type I and II noises. Except SPLS, PCA, and SWPCR, all other methods are seem to be sensitive to the presence of the long-range correlation structure in Type III noise.

Table 1.

Classification results for the first set of simulations: comparison between SWPCR and other Classification Methods. sLDA denotes sparse linear discriminant analysis; SPLS denotes sparse partial least squares; SLR denotes sparse logistic regression; SVM denotes support vector machine; ROAD denotes regularized optimal affine discriminant; and PCA denotes principal component analysis.

Noise	sLDA	SPLS	SLR	SVM	ROAD	PCA	SWPCR
Type I	0.28	0.43	0.45	0.38	0.36	0.36	0.10
Type II	0.27	0.08	0.18	0.26	0.08	0.45	0.03
Type III	0.52	0.30	0.61	0.60	0.50	0.35	0.09

Open in a new tab

4 Real Data Analysis

4.1 ADNI PET Data

The real data set is the baseline fluorodeoxyglucose positron emission tomography (FDG-PET) data downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) web site (www.loni.ucla.edu/ADNI). The ADNI1 PET data set consists of 196 subjects (102 Normal Controls (NC) and 94 AD subjects). There are three subjects, missing the gender and age information. Among the rest of the subjects, there are 117 males whose mean age is 76.20 years with standard deviation 6.06 years and 76 females whose mean age is 75.29 years with standard deviation 6.29 years.

The dimension of the processed PET images is 79 × 95 × 69. Left panel in Figure 3 shows some selected slices of the processed PET images from 2 randomly selected AD subjects and 2 randomly selected NC subjects.

Fig. 3 — ADNI1 pet data and the important score weight matrix *Q_I*_,ℓ in SWPCR. In the left panel, one row sequence of 2-D images belongs to one subject. The first two rows respectively belongs to AD subjects and the rest belongs to NC subjects. In the right panel, the three plots (left -right-bottom) are three view slices of the weight matrix *Q_I*_,ℓ at the coordinate (40, 57, 26). The red region corresponds to large weight score and contains the most classification information.

4.2 Binary Classification

Our first goal is to apply SWPCR in classifying subjects from ADNI1 to AD or CN group based on their FDG-PET images. Such goal is associated with the second primary objective of ADNI aiming at developing new diagnostic methods for AD intervention, prevention, and treatment. Similar as in Section 3, SWPCR contains the three detailed stages that will not be repeated again. The right panel in Figure 3 is the three view slices of the weight matrix Q_I_,ℓ at the coordinate (40, 57, 26) in the stage 2 of SWPCR. The red region in three slices corresponds to the large important score weight and contains the most classification information.

We compared SWPCR with six other classification methods including sLDA, SPLS, SLR, SVM, ROAD, and PCA. We used their leave-one-out cross validation rates. Table 2 shows the classification results of all the seven methods. sLDA performs much worse than all other six methods. ROAD performs slightly better than PCA. SPLS and SVM are comparable with each other, but they outperform SLR and ROAD. SWPCR outperforms all six classification methods. It suggests that the classification performance can be significantly improved by incorporating spatial smoothness and simple dimension reductions methods, such as PCA.

Table 2.

Misclassification Rates of Different Methods for ADNI 1 Pet Data

sLDA	SPLS	SLR	SVM	ROAD	PCA	SWPCR
0.255	0.163	0.179	0.168	0.189	0.194	0.117

Open in a new tab

4.3 Age Prediction

Our second goal is to apply SWPCR in predicting subjects’ age based on their FDG-PET images. The response variable y is the age of the subjects and the explanatory variables are the latent scores, extracted from image data. It is very interesting to use memory test scores as the response variable y. However, the data set here contains no such information. The three subjects without the age information are deleted and then we have 193 images left. y_i in model (10) becomes age of the subjects. Here we will not repeat the detailed stages of SWPCR again, which is similar as in Section 3. The slight difference is stage 3. Here we run regression rather than classification methods between age and the SWPCR latent scores.

First, we compared SWPCR with three other dimensional reduction methods including PCA, weighted PCA (WPCA) [17], and supervised PCA (SPCA) [2]. We used the leave-one-out cross validation to compute the prediction errors of all the four methods. Let ŷ_i be the fitted response value based on the regression model, and the prediction error is defined as |ŷ_i − y_i|/|y_i|. Subsequently, we calculated the error difference between SWPCR and all three other methods across different numbers (K = 5, 7, 10) of principal components. Panels (a)–(c) in Figure 4 show the boxplots of the error difference between SWPCR and PCA, WPCA, and SPCA, respectively. The error differences are almost always less than 0 (under the dashed line) and these results show the better performance of SWPCR in dimension reduction.

Fig. 4 — Performance of SWPCR Regression for ADNI 1 Pet Data. Panels (a)–(c) shows the boxplots of error difference between SWPCR and PCA (WPCA and SPCA). Panel (d) compares SWPCR regression with several other regression methods, including PR, SIS, SVR and SPLS.

Second, we compared SWPCR with several other high-dimensional regression methods including penalized regression (PR) [19], sure independence screening (SIS) regression [7], support vector regression (SVR) [3], and SPLS [5]. Panel (d) in Figure 4 shows the boxplots of the prediction error difference between SWPCR and all the other regression methods. The analysis results further confirm the better performance of SWPCR in regression.

5 Discussion

SWPCR enables a selective treatment of individual features, accommodates the complex dependence among features of graph data, and has the ability of utilizing the underlying spatial pattern possessed by image data. SWPCR integrates feature selection, smoothing, and feature extraction in a single framework. In the simulation studies and real data analysis, SWPCR shows substantial improvement over many state-of-the-art methods for high-dimensional problems.

Acknowledgments

This work was partially supported by the Startup Fund of University of South Florida, NIH grants MH086633, RR025747, and MH092335 and NSF grants SES-1357666 and DMS-1407655..

Contributor Information

Dan Shen, Email: danshen@usf.edu.

Hongtu Zhu, Email: htzhu@email.unc.edu.

References

1.Aharon M, Elad M, Bruckstein A. K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans on Signal Processing. 2006;54:4311–4322. [Google Scholar]
2.Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101(473):119–137. [Google Scholar]
3.Basak D, Pal S, Patranabis DC. Support vector regression. Neural Information Processing-Letters and Reviews. 2007;11(10):203–224. [Google Scholar]
4.Buades A, Coll B, Morel JM. A non-local algorithm for image denoising. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on; IEEE; 2005. pp. 60–65. [Google Scholar]
5.Chun H, Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Roy Statist Soc Ser B. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Clemmensen L, Hastie T, Witten D, Ersbøll B. Sparse discriminant analysis. Technometrics. 2011;53(4):406–413. [Google Scholar]
7.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2012;74(4):745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Friston KJ. Modalities, modes, and models in functional neuroimaging. Science. 2009;326:399–403. doi: 10.1126/science.1174521. [DOI] [PubMed] [Google Scholar]
10.Grenander U, Miller MI. Pattern Theory From Representation to Inference. Oxford University Press; 2007. [Google Scholar]
11.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; Hoboken, New Jersey: 2009. [Google Scholar]
12.Lee M, Shen H, Huang JZ, Marron JS. Biclustering via sparse singular value decomposition. Biometrics. 2010;66:1087–1095. doi: 10.1111/j.1541-0420.2010.01392.x. [DOI] [PubMed] [Google Scholar]
13.Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG. Multiscale adaptive regression models for neuroimaging data. Journal of the Royal Statistical Society: Series B. 2011;73:559–578. doi: 10.1111/j.1467-9868.2010.00767.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
15.Manjón JV, Carbonell-Caballero J, Lull JJ, García-Martí G, Martí-Bonmatí L, Robles M. MRI denoising using non-local means. Medical image analysis. 2008;12(4):514–523. doi: 10.1016/j.media.2008.02.004. [DOI] [PubMed] [Google Scholar]
16.Polzehl J, Spokoiny VG. Propagation-separation approach for local likelihood estimation. Probab Theory Relat Fields. 2006;135:335–362. [Google Scholar]
17.Skočaj D, Leonardis A, Bischof H. Weighted and robust learning of subspace representations. Pattern recognition. 2007;40(5):1556–1569. [Google Scholar]
18.Taylor KM, Meyer FG. A random walk on image patches. SIAM J Imaging Sciences. 2012;5:688–725. [Google Scholar]
19.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58:267–288. [Google Scholar]
20.Yamashita O. Quick manual for sparse logistic regression toolbox ver1.2.1: software. 2011 at http://www.cns.atr.jp/~oyamashi/SLR_WEB/
21.Yan S, Xu D, Zhang B, Zhang HJ, Yang Q, Lin S. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29:40–51. doi: 10.1109/TPAMI.2007.12. [DOI] [PubMed] [Google Scholar]

[R1] 1.Aharon M, Elad M, Bruckstein A. K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans on Signal Processing. 2006;54:4311–4322. [Google Scholar]

[R2] 2.Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101(473):119–137. [Google Scholar]

[R3] 3.Basak D, Pal S, Patranabis DC. Support vector regression. Neural Information Processing-Letters and Reviews. 2007;11(10):203–224. [Google Scholar]

[R4] 4.Buades A, Coll B, Morel JM. A non-local algorithm for image denoising. Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on; IEEE; 2005. pp. 60–65. [Google Scholar]

[R5] 5.Chun H, Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Roy Statist Soc Ser B. 2010;72:3–25. doi: 10.1111/j.1467-9868.2009.00723.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Clemmensen L, Hastie T, Witten D, Ersbøll B. Sparse discriminant analysis. Technometrics. 2011;53(4):406–413. [Google Scholar]

[R7] 7.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2012;74(4):745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Friston KJ. Modalities, modes, and models in functional neuroimaging. Science. 2009;326:399–403. doi: 10.1126/science.1174521. [DOI] [PubMed] [Google Scholar]

[R10] 10.Grenander U, Miller MI. Pattern Theory From Representation to Inference. Oxford University Press; 2007. [Google Scholar]

[R11] 11.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; Hoboken, New Jersey: 2009. [Google Scholar]

[R12] 12.Lee M, Shen H, Huang JZ, Marron JS. Biclustering via sparse singular value decomposition. Biometrics. 2010;66:1087–1095. doi: 10.1111/j.1541-0420.2010.01392.x. [DOI] [PubMed] [Google Scholar]

[R13] 13.Li Y, Zhu H, Shen D, Lin W, Gilmore JH, Ibrahim JG. Multiscale adaptive regression models for neuroimaging data. Journal of the Royal Statistical Society: Series B. 2011;73:559–578. doi: 10.1111/j.1467-9868.2010.00767.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Lin Y. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]

[R15] 15.Manjón JV, Carbonell-Caballero J, Lull JJ, García-Martí G, Martí-Bonmatí L, Robles M. MRI denoising using non-local means. Medical image analysis. 2008;12(4):514–523. doi: 10.1016/j.media.2008.02.004. [DOI] [PubMed] [Google Scholar]

[R16] 16.Polzehl J, Spokoiny VG. Propagation-separation approach for local likelihood estimation. Probab Theory Relat Fields. 2006;135:335–362. [Google Scholar]

[R17] 17.Skočaj D, Leonardis A, Bischof H. Weighted and robust learning of subspace representations. Pattern recognition. 2007;40(5):1556–1569. [Google Scholar]

[R18] 18.Taylor KM, Meyer FG. A random walk on image patches. SIAM J Imaging Sciences. 2012;5:688–725. [Google Scholar]

[R19] 19.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58:267–288. [Google Scholar]

[R20] 20.Yamashita O. Quick manual for sparse logistic regression toolbox ver1.2.1: software. 2011 at http://www.cns.atr.jp/~oyamashi/SLR_WEB/

[R21] 21.Yan S, Xu D, Zhang B, Zhang HJ, Yang Q, Lin S. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007;29:40–51. doi: 10.1109/TPAMI.2007.12. [DOI] [PubMed] [Google Scholar]

PERMALINK

Spatially Weighted Principal Component Regression for High-dimensional Prediction

Dan Shen

Hongtu Zhu

Abstract

1 Introduction

Fig. 1.