Kernel Continuum Regression

Myung Hee Lee; Yufeng Liu

doi:10.1016/j.csda.2013.06.016

. Author manuscript; available in PMC: 2014 Dec 1.

Published in final edited form as: Comput Stat Data Anal. 2013 Dec;68:190–201. doi: 10.1016/j.csda.2013.06.016

Kernel Continuum Regression^{^✩}

Myung Hee Lee ^1,^*, Yufeng Liu ²

PMCID: PMC3777709 NIHMSID: NIHMS502183 PMID: 24058224

Abstract

The continuum regression technique provides an appealing regression framework connecting ordinary least squares, partial least squares and principal component regression in one family. It offers some insight on the underlying regression model for a given application. Moreover, it helps to provide deep understanding of various regression techniques. Despite the useful framework, however, the current development on continuum regression is only for linear regression. In many applications, nonlinear regression is necessary. The extension of continuum regression from linear models to nonlinear models using kernel learning is considered. The proposed kernel continuum regression technique is quite general and can handle very flexible regression model estimation. An efficient algorithm is developed for fast implementation. Numerical examples have demonstrated the usefulness of the proposed technique.

Keywords: Continuum Regression, Kernel regression, Ordinary Least Squares, Principal Component Regression, Partial Least Squares

1. Introduction

Regression is one of the most fundamental and useful statistical techniques. It helps to relate explanatory variables with a response variable and build predictive models. Ordinary Least Squares (OLS) regression estimates the conditional mean of the response variable given covariates and is commonly used in practice. Despite its simple implementation and good interpretability, OLS may face numerical unstability when there exists multi-collinearity among covariates or when the dimension of covariates is relatively high. In that case, Ridge Regression (RR), which can be viewed as a penalized approach, may serve as an alternative.

Another popular group of regression techniques is to perform regression analysis based on a small number of linear transformations of the explanatory variables. For example, Principal Component Regression (PCR) first summarizes multiple explanatory variables, which can be high dimensional, into a few principal component directions and then performs regression on those principal component directions. These principal component directions are orthogonal to each other, yet contain most of the variations in the explanatory variables. Thus, PCR can circumvent the potential numerical difficulty of OLS. Partial Least Squares (PLS) is a related regression technique and it has been widely used in the field of chemometrics. Similar to PCR, PLS also uses a small number of linear transformations of the covariates for regression. The main difference of PLS from PCR is that PCR finds those transformations without the use of the response variable while PLS makes use of both covariates and the response variable to seek for suitable transformations.

With various regression techniques available, it would be desirable to study the differences and connections among these methods. Stone and Brooks (1990) pointed out that the seemingly different regression procedures such as OLS, PLS, and PCR differ only in one aspect: the target quantity maximized at the first step when finding linear transformations of the explanatory variables. Based on this analogy, they formulated a richer family of regression methods, Continuum Regression (CR), by introducing a continuum parameter which connects these three methods.

Similar to other methods, CR also aims to find directional vectors to transform the explanatory variables into new latent predictors which are orthogonal to each other and are constructed as linear combinations of the original predictors. There are two aspects of the latent predictors: one is the variation of the original predictors explained by each latent predictor and the other is the correlation between each latent predictor with the response. The quantity for the CR to maximize involves both variance and correlation of the latent variable with a parameter controlling the relative proportion of two terms. With its flexible construction, the CR is quite general and it contains OLS and PCR as the two extremes and PLS in the middle. In particular, the OLS ignores the variance of the latent variable and maximizes the correlation between the observed response vector and the predicted response vector. In contrast, PCR finds the regression directional vector so that the variance of the latent predictor is maximized. Interestingly, PLS essentially maximizes the covariance between the observed and the predicted response vectors. Besides these three special cases, CR also covers many other methods in the whole spectrum. Frank and Friedman (1993) provides a nice overview of CR and other related regression techniques. Sundberg (1993) and Björkström and Sundberg (1999) reveal some close connection between the RR and the set of CR. Chen and Cook (2010) investigated some asymptotic properties of CR.

The CR approach is potentially useful when the relationship between the response and the explanatory variable is linear. However, when the true relationship is nonlinear, the predictive performance of the CR family can be improved if the model is built as a nonlinear function of the explanatory variables.

In the literature, there has been some work in this direction which generalizes some special cases of CR via kernel learning. The nonlinear generalization of the building blocks for PCR, i.e., nonlinear Principal Component Analysis (PCA) has been studied in the field of pattern recognition, where lower dimensional feature extraction of high dimensional data becomes an important task. See Schölkopf et al. (1998),Mika et al. (1999), and Shawe-Taylor and Cristianini (2004) for more details.Rosipal et al. (2000b) andRosipal et al. (2000a) deal with nonlinear generalization of PCR, which uses nonlinear PCA as the latent variables in the regression analysis. As in the linear case, the feature extraction based on PCA is done not specifically to the regression problem at hand, and consequently the predictive performance of PCR is usually not as good as PLS. Walczak and Massart (1996) and Rosipal and Trejo (2001) generalize the PLS to incorporate nonlinear cases.

In this present paper, we extend the linear CR model to nonlinear CR model using the powerful kernel trick concept in machine learning. The proposed kernel CR (kCR) incorporates the special cases such as kernel OLS, kernel PLS and kernel PCR in one unified framework. If a linear kernel map is chosen, the kCR is the same as the ordinary CR. Section 2 provides mathematical formulation of kCR. The first part is devoted to present systematic ways to construct latent variables from an optimization point of view, and the second part is to run a regression analysis with the selected latent variables. Section 3 gives the details of the algorithm for solving the optimization problem in the first step. Numerical performance of the proposed method is investigated in Section 4 through simulation examples and real data analysis. We conclude the paper with some brief discussion in Section 5.

2. Continuum Regression and Its Kernel Extension

In this section, we first briefly review the linear CR in Section 2.1 and then introduce its kernel extension in Section 2.2.

2.1. Review of Linear Continuum Regression

Suppose that we have n pairs of data points for regression, ${(x_{i}, y_{i})}_{i = 1}^{n}$ , where x_i ∈ ℝ^d are the explanatory variables and y_i ∈ ℝ is the response variable. Define the n × d input data matrix as X = [x₁,…, x_n]^T, where each row vector x_i represents a d-dimensional input vector for i = 1, …, n. The output data vector is denoted by y = (y₁, …, y_n)^T. We assume that the data are mean-centered so that each column sum of the matrix X is 0. Denote X and Y as the random predictor vector and response variable respectively. Furthermore, define the scatter matrix of the data X as S_d×d = X^TX and the cross covariance matrix between X and y as s = X^Ty.

We now describe the linear CR technique in terms of its optimization criterion. CR is essentially a two-step regression procedure where in the first step, one finds a set of direction vectors in the input variable space, ℝ^d, and makes projections of data onto the subspace generated by these vectors. In the second step, we use these extracted features as regressors to build a regression model to predict the value of the response variable Y. In particular, for a given parameter α ∈ [0, 1], suppose that the first k direction vectors c₁, …, c_k have been constructed and we want to find c_k+1 by maximizing

T (c) = C o v {(c^{T} X, Y)}^{2} V a r {(c^{T} X)}^{α / (1 - α) - 1} = {(c^{T} X^{T} y)}^{2} {(c^{T} X^{T} X c)}^{α / (1 - α) - 1} = {(c^{T} s)}^{2} {(c^{T} S c)}^{α / (1 - α) - 1}

(1)

subject to the constraints ‖c‖² = 1 and Corr $(c^{T} X, c_{j}^{T} X) = c^{T} S c_{j} = 0, j = 1, \dots, k$ .

Stone and Brooks (1990) showed that CR includes OLS and PCR at the two extremes, α = 0 and α → 1, respectively. Specifically, OLS can be viewed as maximizing correlation. In particular, the multiple correlation coefficient is maximized over all direction vectors c of the correlation between y and Xc. When α = 0, the optimization criterion in (1) reduces to the multiple correlation coefficient for OLS. For the other extreme of this family with α → 1, the variance term in (1) dominates the optimization criterion and the role of y will be ignored. In that case, CR finds PCR directions which give linear combinations of d covariates with maximum variations. In this family, PLS can be viewed as a compromise of the two extremes at α = 0.5. Once these direction vectors are obtained, one can construct the corresponding latent predictors and build regression models using these new predictors. In this case, the regression models are linear as the latent predictors are linear combinations of the original predictors.

PLS has been extensively used in the field of chemometrics since it was first introduced by Wold (1975). It was presented in an iterative algorithmic form as an alternative to OLS in the presence of high multicollinearity among input variables. Since its introduction, there have been various algorithms proposed for PLS (Helland, 1990; Naes and Martens, 1985), but we will stick to the version given as an optimization solution in Stone and Brooks (1990). PLS shares similarity with PCR in the sense that it extracts potential regressors by creating a set of orthogonal transformation of input variables. It is different since it directly makes use of the output variable when creating a set of latent variables - the selected latent variable maximizes the covariance between the output variable among all linear combinations of input variables. A least squares regression technique is applied once the latent variables are constructed.

As discussed earlier, the optimization problem (1) covers many regression techniques, including OLS, PLS, and PCR as special cases. In the next section, we discuss the extension of CR to the more general kernel framework.

2.2. Kernel Continuum Regression

Linear CR introduced by Stone and Brooks (1990) provides a nice regression framework which includes several well known regression methods as well as many other new techniques in this family. It helps us to better understand regression tools. Despite its usefulness, the original proposal was restricted to the linear setting. In this section, we explain how to extend the linear CR to a general kernel version.

Consider a mapping ϕ from X ∈ R^d to a feature space, ℱ, and we propose to build a linear regression model using a set of latent variables in the feature space. Let 〈·, ·〉 be the inner product defined in ℱ. For presentation of the algorithm, we use some matrix notations for the feature mapped data. Denote the feature mapped data matrix as ϕ(X) = [ϕ (x₁), …, ϕ (x_n)]^T and the n × n kernel matrix as K_(i,j) = 〈ϕ (x_i), ϕ (x_j)〉. The n × 1 vector storing the projection of each data vector onto the direction vector c ∈ ℱ as entries will be denoted by 〈c, ϕ (X)〉. Note that our use of the kernel function relates to the kernel trick concept used in machine learning. For example, the well known Support Vector Machine (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000) utilizes the kernel trick to achieve nonlinear learning.

Let us define the transformed continuum parameter as γ = α/(1−α) > 0. Then the problem of finding direction vectors (1) in the feature space can be modified to the following:

max T_{ϕ} (c) = C o v {(〈 c, ϕ (X) 〉, Y)}^{2} V a r {(〈 c, ϕ (X) 〉)}^{γ - 1} = {{〈 c, ϕ (X) 〉}^{T} y}^{2} {{〈 c, ϕ (X) 〉}^{T} 〈 c, ϕ (X) 〉}^{γ - 1}

(2)

subject to the two constraints: 〈c, c〉 = 1 and 〈c, ϕ(X)〉^T 〈c_j , ϕ(X)〉 = 0, j = 1, …, k.

Next we show that the maximum value can be always achieved in the subspace generated by the mapped data vectors, ℛ_ϕ(X) = 𝒞{ϕ(x₁), …, ϕ(x_n)} ⊂ ℝ^∞. Let c be any unit vector in the feature space and P and Q be the projection and the orthogonal projection maps onto ℛ_ϕ(X), respectively. Consider the projection of c onto ℛ_ϕ(X) and let us compare the objective function T_ϕ evaluated at c and c̃ = Pc/‖Pc‖, then we have

T_{ϕ} (c) = {{〈 P c + Q c, ϕ (X) 〉}^{T} y}^{2} {{〈 P c + Q c, ϕ (X) 〉}^{T} 〈 P c + Q c, ϕ (X) 〉}^{γ - 1} = {{〈 P c, ϕ (X 〉)}^{T} y}^{2} {{〈 P c, ϕ (X) 〉}^{T} 〈 P c, ϕ (X) 〉}^{γ - 1} \leq {{〈 \tilde{c}, ϕ (X) 〉}^{T} y}^{2} {{〈 \tilde{c}, ϕ (X) 〉}^{T} 〈 \tilde{c}, ϕ (X) 〉}^{γ - 1} = T_{ϕ} (\tilde{c}) .

Projection of any direction vector onto the subspace, ℛ_ϕ(X), will result in an improvement to achieve the maximization of T_ϕ. Consequently, we can restrict the solution c to be in ℛ _ϕ(X), i.e., we search for the maximizer of the form $c = \sum_{i = 1}^{n} a_{i} ϕ (x_{i}) = ϕ {(X)}^{T} a$ . Using this representation, a simple algebraic calculation shows that the i-th data vector projected on the j-th direction vector c_j can be expressed as a matrix multiplication with 〈c_j, ϕ(x_i)〉 = i-th row of K · a_j. This means that if we use these transformed data in the analysis, a direct mapping ϕ(x_i) is not necessary to acquire as long as the kernel matrix K and the weight vector a_j's are available.

Based on the above considerations, the objective function (2) can be written as the following:

T_{ϕ} (c) = {{〈 a^{T} ϕ (X), ϕ {(X) 〉}^{T} y)}^{2}}^{2} {{〈 a^{T} ϕ (X), ϕ (X) 〉}^{T} 〈 a^{T} ϕ (X), ϕ (X) 〉}^{γ - 1} = {a^{T} K y}^{2} {a^{T} K^{2} a}^{γ - 1} .

Furthermore, the orthonormality constraints can be rewritten as

〈 c, c 〉 = a^{T} K a = 1

and

{〈 c, ϕ (X) 〉}^{T} 〈 c_{j}, ϕ (X) 〉 = a^{T} K^{2} a_{j} = 0,

where c_j = ϕ(X)^Ta_j for some a_j ∈ ℝⁿ, j = 1,…, k, are the k previously constructed continuum direction vectors. Consequently, for a given γ > 0, the optimization problem (2) in the feature space ℱ, can be formulated as an n-dimensional optimization problem:

max_{a \in ℝ^{n}} {(a^{T} K y)}^{2} {(a^{T} K^{2} a)}^{γ - 1}

(3)

subject to a^TKa = 1 and a^TK²a_j = 0 for j = 1,…, k. Note that the problem depends on the data vectors only through the kernel matrix K. The detailed algorithm on how to solve this dual problem (3) is presented in Section 3.

Once a₁,…, a_k are solved, we can get the first k kernel continuum directional vectors c₁,…, c_k ∈ ℱ via c_j = ϕ(X)^Ta_j. Next, we extract information by projecting the mapped data onto these direction vectors and use them as regressors for the regression fit procedure in the second step. Specifically, we use the continuum latent variables, 〈(X), c₁〉,…, 〈ϕ(X), c_k〉, as regressor variables in a linear regression model as Y = β₁ 〈(X), c₁〉 + ⋯ + β_k 〈ϕ(X), c_k〉 + ε.

Utilizing transformed data, a simple expression for the OLS estimate of the coefficient vector β = (β₁,…, β_k) is obtained as β̂ = (X̃^TX̃)⁻¹X̃^Ty, where (i, j)-th entry of X̃_n×k is the projection of the i-th data vector onto the j-th continuum direction vector, 〈c_j, ϕ(x_i)〉, i.e., X̃_n×k = K[a₁,…, a_k].

Prediction of the level of the response variable can be made on any input data point x ∈ ℝ^d, Ŷ = β̂₁〈ϕ(x), c₁〉 + ⋯ + β̂_k〈ϕ(x), c_k〉. Suppose that we have a set of testing points ${X_{i}^{*}}_{i = 1, \dots, n_{t}}$ to make predictions on y values, then the vector representation of the set of predicted values is given as follows:

\hat{y} = ϕ (X^{*}) [c_{1}, \dots, c_{k}] \hat{β} = K_{t} [a_{1}, \dots, a_{k}] \hat{β},

where K_t is an n_t × n matrix whose (i, j)-th element is $〈 ϕ (x_{i}^{*}), ϕ (x_{j}) 〉$ for i = 1,…, n_t and j = 1,…, n. Therefore, we can achieve nonlinear regression modeling building and prediction using the kernel function, without the necessity of having explicit ϕ(·). The kernel representation of the function can also be shown using the famous Representer Theorem by Kimeldorf and Wahba (1971) when a reproducing kernel is applied. In that case, the corresponding functional space is a reproducing kernel Hilbert space.

3. Construction Algebra for Kernel CR

Fast and simple implementation is essential for most statistical techniques. In this section, we discuss how to implement kCR. Note that our proposed kCR includes the original linear CR as a special case when the linear kernel is applied.

If the kernel matrix K is not a full rank matrix, the representation of the solution of a in (3) is not uniquely defined. In order to avoid ambiguity in the representation, we write the solution as the linear combination of the eigenvectors of the kernel matrix corresponding to nonzero eigenvalues. Let $K_{(n \times n)} = U_{(n \times m)} E_{(m \times m)} U_{(m \times n)}^{T}$ be the eigen-decomposition of the kernel matrix K where m is the rank of K and E is the diagonal matrix of positive eigenvalues of K. Then, the solution of the optimization problem can be expressed as a_k+1 = UE^−1/2z, for some z ∈ ℝ^m. It can be directly checked that the maximization problem (3) can be formulated as the following:

max_{z} {(z d)}^{2} {(z^{T} E z)}^{γ - 1},

(4)

subject to z^Tz = 1 and z^TEZ_k = 0, where d = E^1/2U^Ty and Z_k = [z₁,…, z_k].

The problem above is equivalent to the maximization of its Lagrangian form:

T^{*} (z) : = {(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 1} - λ (z^{T} z - 1) - 2 z^{T} E Z_{k} Λ_{k},

subject to z^Tz = 1 and z^TEZ_k = 0, where Λ_k = [λ₁,…, λ_k] and λ, λ₁,…, λ_k are the Lagrange multipliers. The maximizer should be the solution to the following equation:

\frac{\partial T^{*}}{\partial z} = 2 (z^{T} d) {(z^{T} E z)}^{γ - 1} d + 2 (γ - 1) {(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 2} E z - 2 λ z - 2 E Z_{k} Λ_{k} = 0 .

(5)

Now pre multiply z^T to (5), then we have

{(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 1} + (γ - 1) {(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 1} - λ z^{T} z - z^{T} E Z_{k} Λ_{k} = 0 .

Since z^Tz = 1 and z^TEZ_k = 0, we can express λ in terms of the data as λ = γ(z^Td)²(z^TEz)^γ−1. Plugging this back into the equation (5), we obtain

(z^{T} d) {(z^{T} E z)}^{γ - 1} d + (γ - 1) {(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 2} E z - γ {(z^{T} d)}^{2} {(z^{T} E z)}^{γ - 1} z - E Z_{k} Λ_{k} = 0 .

(6)

For the sake of simplicity, let us write the scalars as

τ_{z} = z^{T} d ρ_{z} = z^{T} E z .

(7)

Writing the equation (6) along with the linear constraint, z^TEZ_k = 0, in a matrix form, we obtain the following:

(\begin{matrix} γ τ_{z}^{2} ρ_{z}^{γ - 1} I + (1 - γ) τ_{z}^{2} ρ_{z}^{γ - 2} E & E Z_{k} \\ Z_{k}^{T} E & 0 \end{matrix}) (\begin{matrix} z \\ Λ_{k} \end{matrix}) = (\begin{matrix} τ_{z} ρ_{z}^{γ - 1} d \\ 0 \end{matrix}) .

(8)

To simplify the expression, we introduce partitioned matrix notations, and rewrite the equation above as

(\begin{matrix} A & B \\ B^{T} & 0 \end{matrix}) (\begin{matrix} z \\ Λ_{k} \end{matrix}) = (\begin{matrix} q \\ 0 \end{matrix}),

where $A = γ τ_{z}^{2} ρ_{z}^{γ - 1} I + (1 - γ) τ_{z}^{2} ρ_{z}^{γ - 2} E, B = E Z_{k}, and q = τ_{z} ρ_{z}^{γ - 1} d$ . Using the following fact on the inverse of the partitioned matrix,

{(\begin{matrix} A & B \\ B^{T} & 0 \end{matrix})}^{- 1} = (\begin{matrix} A^{- 1} - A^{- 1} B {(B^{T} A^{- 1} B)}^{- 1} B^{T} A^{- 1} & A^{- 1} B {(B^{T} A^{- 1} B)}^{- 1} \\ {(B^{T} A^{- 1} B)}^{- 1} B^{T} A^{- 1} & {(B^{T} A^{- 1} B)}^{- 1} B^{T} A^{- 1} \end{matrix})

and imposing the unit vector condition, the unit vector solution z to (8) becomes

z = \frac{M q}{‖ M q ‖},

(9)

where M = A⁻¹ − A⁻¹B(B^TA⁻¹B)⁻¹B^TA⁻¹. Note that z does not depend on τ_z since this quantity is canceled out when normalized. It is important to point out that everything in (9) is known except for the scalar τ_z, which is determined by z.

Combining the iterative updating scheme between z and ρ using the relationship (7) and (9), we are given a fixed point problem ρ = g(ρ), where g(ρ) = z(ρ)^TEz(ρ) and z(ρ) is the vector evaluated at ρ as in the function on the right hand side of (9). From the equation (7), it is clear that the solution ρ lies between [e_m, e₁], where e₁ (e_m) is the largest (smallest) eigenvalue of E. This observation essentially brings down our solution search into a 1-dimensional closed space.

A nutshell to summarize the kCR algorithm to find the maxima z for the problem (4) is shown below.

Step 1. Fix τ as an arbitrary value, say τ = 1.
Step 2. Fix the ρ ∈ [e_m, e₁].
Step 3. Compute $z (ρ) = \frac{M q}{‖ M q ‖}$ , where
$M = A^{- 1} - A^{- 1} B {(B^{T} A^{- 1} B)}^{- 1} B^{T} A^{- 1}, A = γ τ^{2} ρ^{γ - 1} I + (1 - γ) τ^{2} ρ^{γ - 2} E, B = E Z_{k}, q = τ ρ^{γ - 1} d .$
Step 4. Evaluate g(ρ) = z(ρ)^TEz(ρ).
Step 5. Based on Steps 1-4, solve the nonlinear equation ρ = z(ρ)^TEz(ρ).

Let the solution be ρ*.
Step 6. If there exist multiple roots for the nonlinear equation in Step 5, take the value that actually gives rise to z(ρ*) which achieves the largest objective function value (4) among all the roots ρ*'s. Let this optimal value be ρ*.
Step 7. Complete search for the k-th continuum weight vector a_k by computing z(ρ*) as in Step 3 and let $a_{k} = U E^{- \frac{1}{2}} z (ρ^{*})$ .

Note that the main challenging step is Step 5 of the algorithm where a numerical equation needs to be solved. However, since the equation is only a one-dimensional problem, typically it can be computed efficiently.

4. Numerical Study

In this section, we examine the performance of the proposed kCR using both simulated and real data examples. Our goal is to show the flexibility of kernel learning to achieve nonlinear regression function estimation as well as to examine the effect of the kCR parameter α.

4.1. Simulation

We generate an n × d dimensional X data matrix from a t-dimensional factor model,

X = T P' + E,

where T is an n×t random matrix and P is a fixed d×t orthonormal matrix and E is a noise matrix. We set k = 3, d = 7, n = 100, β₁ = … = β_d = 1. Each row of T is generated from the t dimensional Gaussian distribution N_t(0, ∑), where ∑ = diag(3, 3, 3, 1,…, 1), and each row of E is generated from N_d(0, .5²I).

For the relationship between X and Y, we consider four different nonlinear regression problems and investigate the finite sample performance of kCR as described below. Let

f_{1} (x) = sin (x); f_{2} (x) = sin | x | / | x |; f_{3} (x) = 1 / 2 \sqrt{x + 10} - 1; f_{4} (x) = 5 + x - 5 x^{2} + . 2 x^{3}; f_{5} (x, y) = 5 + x y^{2} .

With these f functions, we consider four different settings as follows.

Setting 1: Y = X₁β₁ + ⋯ + X_dβ_d + ε;
Setting 2: Y = 5f₂(X₁)+5f₁(X₂−5)+f₄(X₃)+X₄β₄+⋯+ X_dβ_d+ε;
Setting 3: Y = 2f₂(X₁) + f₁(X₂/3) + f₂(X₂) + X₃β₃ + ⋯ + X_dβ_d + ε;
Setting 4: Y = 5f₂(X₁)+10f₃(X₂)+f₅(X₂,X₃)+X₄β₄+⋯+X_dβ_d+ε.

Clearly, Setting 1 contains a linear regression problem. Settings 2-4 contain nonlinear functions. To illustrate this, Figure 1 displays scatter plots between (X_i, Y) for i = 1, 2, 3 for Setting 2. The true nonlinear relationship between X_i's and Y are also shown as dotted red curves. Clearly, the relationship between X_i and Y is nonlinear.

Scatter plots of (*X_i, Y*) for Setting 2. Top: Y against X₁, Middle: Y against X₂, Bottom: Y against X₃.

When fitting the kCR, we consider three different choices of kernels: i) the linear kernel K(x, y) = x^Ty, ii) the polynomial, K(x, y) = (x^Ty+c)^m, and iii) the Gaussian kernel K(x, y) = e^{−‖x−y‖²/h}. For the polynomial kernel, we set m = 2, choose $c = - {min}_{i, j} (x_{i}^{T} x_{j})$ , and scale the kernel matrix so that the entries of K do not exceed 1. This step was needed to make the polynomial kernel stable. For the Gaussian kernel, the parameter h is chosen as the first quartile of all pairwise squared distances between X data vectors. There are two tuning parameters in kCR, α ∈ [0, 1] and the number of components to be used in the regression, denoted by k. The parameter k can be as large as the rank of the kernel matrix K in theory, but in practice we want to keep k relatively small to avoid overfitting. We ran kCR with α ranging from 0 to 1 with an increment of 0.1 and k from 1 to 5. The tuning parameters are selected to minimize the validation error on an independent tuning set of size n = 100. Once the tuning parameters are selected, the test error is validated on an independent dataset of size m = 1000.

In order to see the effect of tuning parameter α, we present fitted values using the Gaussian kCR with three different values of α = 0, 0.5, and 1 in Figure 2. Scatter plot of Y against X₁ for Setting 2 is shown in Figure 2. The fitted Y values using the Gaussian kernel are overlaid as magenta squares. We set the number of latent variables as k = 1. The top (middle and bottom) panel shows kCR with α = 0 (α = .5 and α = 1). The predicted value on the top panel is very wiggly since it incorporates the sampling artifact greatly whereas the α = 1 case does not pick the relationship with Y well since the latent variables are extracted to explain most variability of X data.

Fitting results for Setting 2: fitted Y against X₁ with α = 0 (top), .5 (middle), and 1 (bottom) Gaussian kCR. The kCR with small α captures the relationship with Y.

Figures 3 and 4 show the boxplots of the test errors for kCR over 50 replications. For Setting 1 (top panel in Figure 3), the linear kernel gives better performance. This is expected since the underlying relationship is linear. In Setting 2, all three kernels show similar performance. In both cases considered, the family of kCR does not vastly improve the performance from the individual regression fit (either α = 0, 0.5 or 1), possibly because the nonlinearity is relatively mild. In Settings 3 and 4, the Gaussian kernel is the best, followed by the polynomial kernel, and the linear kernel shows the worst performance. As the true relationship is highly nonlinear, using a model that accounts for nonlinearity greatly improves the performance. Results from the two nonlinear kernels, Polynomial and Gaussian, are comparable.

Box plots of test errors for the tuned kCRs over 50 replications. Top: Setting 1. Bottom: Setting 2

Box plots of test errors for the tuned kCRs over 50 replications. Top: Setting 3. Bottom: Setting 4

Table 1 reports the average selected tuning parameters for kCR over 50 replications. The standard errors are reported in parentheses. The linear kernel tends to find a smaller number of latent variables than the Gaussian kernel does.

Table 1.

Average selected tuning parameters (α, k) over 50 replications. Standard deviations are reported in parentheses.

	Linear		Polynomial		Gaussian
	α	k	α	k	α	k
Setting 1	0.53 (0.24)	2.38 (1.05)	0.70 (0.24)	2.98 (1.29)	0.38 (0.15)	3.18 (1.42)
Setting 2	0.90 (0.18)	2.04 (1.32)	0.98 (0.04)	1.94 (1.00)	0.88 (0.17)	2.80 (1.31)
Setting 3	0.64 (0.27)	2.38 (1.09)	0.44 (0.30)	2.84 (1.74)	0.35 (0.13)	3.54 (1.01)
Setting 4	0.63 (0.26)	2.34 (1.17)	0.54 (0.26)	3.28 (1.51)	0.27 (0.15)	3.32 (1.41)

Open in a new tab

For the remaining section, in an effort to evaluate the computational cost of the algorithm developed in Section 3, we consider different combinations of (n, d) and compare the computation time. We use Setting 2 to create the data set. In this set up, the first 4 factors relate to Y in nonlinear fashions whereas the other factors have a linear relationship with Y. Dimensions of X are set as d = 7, 14, 28, 56, 112, 224, 448 and three different sample sizes are considered with n = 50, 100 and 200. We used the tic/toc function in MATLAB to measure the computation time in seconds and the average time over 5 repeats are reported in Figure 5. In the algorithm, one needs to perform eigen-decomposition of the kernel matrix, K. For this reason, the computational time increases as n increases, however, the computation cost does not increase rapidly with dimensions.

Plot of running time (in seconds on y-axis) versus dimensionality (d on x-axis) and the sample size n. The symbols *, ○ and · are for n = 50, 100 and 200 respectively.

4.2. Real Data Analysis

In this section, we examine the performance of the proposed kCR using two publicly available data sets; diabetes (available at http://www.stanford.edu/~Hastie/Papers/LARS/) and Boston housing (available at http://archive.ics.uci.edu/ml/datasets/Housing). In the diabetes data set, 10 clinical variables (age, BMI, serum measurements, etc.) as well as a quantitative measurements for the progress of the disease (the response variable) were collected from n = 442 diabetes patients. A detailed explanation can be found inEfron et al. (2004). The Boston housing data set contains n = 506 census tracts of Boston from the 1970 census, which includes 12 continuous and 1 binary covariates (per capita crime rate by town, average number of rooms per dwelling, full-value property-tax rate, etc). The quantitative response variable is the median house value in USD.

We randomly split each data set into a training set and a test set, each of which approximately contained 2/3 and 1/3 of the whole data, respectively. Then, all methods were tuned by 10-fold CV within the training set. Once the tuning parameters get selected, the tuned methods are fit and tested on the test set. Shown in Figures 6 and 7 are the box plots of test error rates over 50 random splits. From the results, we can see that for the Diabetes data, all methods perform similarly while linear methods give the best overall performance. However, kCR gives comparable performance, despite the use of a much larger functional space through kernel learning. For the Boston housing data example, kCR methods yield much better performance than linear methods. This indicates that the underlying relationship is likely to be nonlinear. Our finding matches the previous analysis of this dataset in the literature, see for exampleZhang et al. (2011).

Diabetes data set: Box plots of test errors for the tuned kCRs over 50 replications.

Boston Housing data set: Box plots of test errors for the tuned kCRs over 50 replications.

5. Discussion

Regression analysis is one of the most fundamental statistical tools. CR offers an attractive framework connecting many linear regression techniques in a spectrum. In this paper, we extend linear CR into a much more flexible kernel setting. Our proposed kCR covers the linear CR as a special case. Furthermore, the proposed kCR includes the kernel PLS and kernel PCR as members as well. An efficient algorithm is developed for fast computation. Our numerical examples demonstrate the usefulness of our proposed methodology.

Our current kCR does not perform variable selection. In many applications, variable selection can be useful to obtain accurate and interpretable models. To achieve that goal, one may consider more flexible forms of kernel functions to allow the method to automatically remove variables. Existing literature in nonlinear variable selection, for example, the COSSO (Lin and Zhang, 2007) and KNIFE (Allen, 2013), can be useful here. In that case, the corresponding computational algorithm can be much more challenging. Further investigation will be pursued.

Another line of extension is to rubustify kCR. The sample covariance and the sample variance used to extract features in the first construction step may not be robust in the presence of outliers, which may affect estimation of the regression coefficients at the second step. For the linear continuum regression case, a robust version was introduced by Serneels et al. (2005). A robust counterpart for kCR will be useful for practical problems.

Supplementary Material

NIHMS502183-supplement-01.zip^{(1.1KB, zip)}

NIHMS502183-supplement-02.zip^{(1.2KB, zip)}

NIHMS502183-supplement-03.zip^{(565B, zip)}

NIHMS502183-supplement-04.zip^{(373B, zip)}

NIHMS502183-supplement-05.zip^{(402B, zip)}

Acknowledgements

The authors would like to thank the editor, the associate editor, and three anonymous reviewers for their constructive comments and suggestions. Yufeng Liu’s research was partially supported by the NIH grant NIH/NCI R01 CA-149569.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

^✩

This article has supplementary material online.

Contributor Information

Myung Hee Lee, Department of Statistics, Colorado State University, Fort Collins, CO 80525, U.S.A..

Yufeng Liu, Department of Statistics and Operations Research, Carolina Center for Genome Sciences, Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599.

References

Allen GI. Automatic feature selection via weighted kernels and regularization. Journal of Computational and Graphical Statistics. 2013 To Appear. [Google Scholar]
Björkström A, Sundberg R. A generalized view on continuum regression. Scandinavian Journal of Statistics. 1999;26(1):17–30. [Google Scholar]
Chen X, Cook D. Some insights into continuum regression and its asymptotic properties. Biometrika. 2010;97(4):985–989. [Google Scholar]
Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press; 2000. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32(2):407–499. [Google Scholar]
Frank I, Friedman J. A statistical view of some chemometrics regression tools. Technometrics. 1993;35(2):109–135. [Google Scholar]
Helland IS. Partial least squares regression and statistical models. Scandinavian Journal of Statistics. 1990;17:97–114. [Google Scholar]
Kimeldorf G, Wahba G. Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications. 1971;33(1):82–95. [Google Scholar]
Lin Y, Zhang HH. Component selection and smoothing in multi-variate nonparametric regression. Annals of Statistics. 2007;34(5):2272–2297. [Google Scholar]
Mika S, Schölkopf B, Smola AJ, Müller K-R, Scholz M, Rätsch G. Kernel pca and de-noising in feature spaces. Advances in Neural Information Processing Systems. 1999;11:536–542. [Google Scholar]
Naes T, Martens H. Comparison of prediction methods for multi-collinear data. Communications in Statistics - Simulation and Computation. 1985;14(3):545–576. [Google Scholar]
Rosipal R, Girolami M, Trejo LJ. Kernel PCA for feature extraction and de-noising in non-linear regression. Neural Computing and Applications. 2000a;10:231–243. [Google Scholar]
Rosipal R, Trejo LJ. Kernel partial least squares regression in reproducing kernel hilbert space. Journal of Machine Learning Research. 2001;2(2):97–123. [Google Scholar]
Rosipal R, Trejo LJ, Cichocki A. Kernel principal component regression with EM approach to nonlinear principal components extraction. Tech. rep. 2000b
Schölkopf B, Smola A, Müller K-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation. 1998;10:1299–1319. [Google Scholar]
Serneels S, Filzmoser P, Croux C, Van Espen PJ. Robust continuum regression. Chemometrics and Intelligent Laboratory Systems. 2005;76:197–204. [Google Scholar]
Shawe-Taylor J, Cristianini N. Kernel Methods for Pattern Analysis. Cambridge University Press; 2004. [Google Scholar]
Stone M, Brooks RJ. Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical Society Series B. Methodological. 1990;52(2):237–269. with discussion and a reply by the authors. [Google Scholar]
Sundberg R. Continuum regression and ridge regression. Journal of the Royal Statistical Society. Series B (Methodological) 1993;55(3):653–659. [Google Scholar]
Vapnik VN. Statistical Learning Theory. Wiley-Interscience; 1998. [Google Scholar]
Walczak B, Massart D. The radial basis functions - partial least squares approach as a flexible non-linear regression technique. Analytica Chimica Acta. 1996;331:177–185. [Google Scholar]
Wold S. Soft modeling by latent variables; the nonlinear iterative partial least squares approach. Perspectives in Probability and Statistics. Papers in Honour of M. S. Bartlett. 1975:520–540. [Google Scholar]
Zhang HH, Cheng G, Liu Y. Linear or nonlinear? automatic structure discovery for partially linear models. Journal of the American Statistical Association. 2011;106:1099–1112. doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials