Abstract
Tikhonov regularization was proposed for multivariate calibration by Andries and Kalivas [1]. We use this framework for modeling the statistical association between spectroscopy data and a scalar outcome. In both the calibration and regression settings this regularization process has advantages over methods of spectral pre-processing and dimension-reduction approaches such as feature extraction or principal component regression. We propose an extension of this penalized regression framework by adaptively refining the penalty term to optimally focus the regularization process. We illustrate the approach using simulated spectra and compare it with other penalized regression models and with a two-step method that first pre-processes the spectra then fits a dimension-reduced model using the processed data. The methods are also applied to magnetic resonance spectroscopy data to identify brain metabolites that are associated with cognitive function.
Keywords: penalized regression, Tikhonov regularization, calibration, adaptive penalty, generalized singular value decomposition
1. Introduction
This paper provides a statistical perspective on regression models for associating a vector of spectroscopy data, x, with a scalar outcome, y, such as a patient’s disease status or phenotype. In the case of a continuous outcome, a multiple linear regression model is of the form y =β⊤x + ε, where, ε ~ N(0, σ2) denotes the random error corresponding to y − β⊤x; i.e., E(y) = β⊤x, where β is the (unknown) p × 1 coefficient vector whose kth coordinate models the association of the corresponding coordinate in x (k = 1, … , p) with the observed outcome y. Or, in matrix notation, this takes the form y = Xβ + ε, where X denotes an n × p matrix whose ith row is the p−vector of spectroscopy measures, y is the n × 1 vector of observed outcomes, and ε the vector of model errors. If, for example, a small set of p < n analyte features has been extracted, then the ordinary least-squares (OLS) solution is the minimum variance unbiased estimate for the linear association between y and x. In this paper we describe the estimation of β via a process that does not first undertake a wavelength- or variable-selection step, instead forming a model using the full spectra in which, typically, p ≫ n, and hence has infinitely many solutions.
The topic of regularizing ill-posed linear systems has a long history and is well studied in a variety of contexts, including image analysis [3], statistics [26, 13], mathematics [25, 11, 9], and calibration [5, 15], to name a few. Our approach is very similar in spirit to calibration methods of Andries and Kalivas [1] (hereafter referred to as A&K) but with several important conceptual differences. In particular, rather than focusing on spectral calibration we formulate a statistical framework aimed at estimating the potential association between a set of subject-specific outcomes (something external to the collection of spectra, such as disease status) with the cohort’s set of spectroscopy data. There may, in fact, not be any association (the null hypothesis) but in the event that there is, it is useful to further identify which spectral properties (wavelengths or analyte features) are associated. That is, our primary goal is the estimation of β rather than the prediction of y. With this in mind, we have taken care to establish notation that distinguishes between a statistical multiple regression model and the conceptually similar problem of multivariate calibration. In the latter, denotes a set of known, non-random analyte concentrations corresponding to calibration spectra . Using notation analogous to the above, this may be represented in matrix form as Xb = y + e, where the aim is to estimate a model vector, b, which may then be used to predict analyte concentrations in future samples. In this setting, e represents a vector of unmodeled residuals where the source of error comes from X (as opposed to y, above). We note that in either setting, the minimum-norm least-squares estimate (of b or β) can obtained from among the infinitely many solutions [11]. This solution, however, can be highly unstable and even useless in a statistical setting when X⊤X is singular or ill-conditioned.
In summary, our presentation may be viewed as an offshoot of A&K in two ways. First, rather than approximating a model vector, b, in a calibration setting we focus on estimation of a coefficient function, β, in a statistical regression model for a biomedical study (section 2). Second, although we perform the estimation of β within the same Tikhonov regularization framework used by A&K, our penalty term serves to focus the estimation process toward a “favored subspace” of relevant signal rather than filtering out signal in the “interferent space” (Section 3.2). Finally, we propose a novel approach to the estimation of β by adaptively refining the favored space.
Conceptually, the favored space contains the “true” β and is penalized lightly while its orthogonal complement is penalized heavily. Note, however, if is large and underpenalized, the variance of the estimate will be too large and so we develop the concept of an adaptive penalty to account for the fact that one cannot know, a priori, the precise subspace of features in x that are associated with y. Therefore, we describe an estimation process that begins with a large favored space and iteratively pares down the size of this space, converging to a more stable estimate.
In order to evaluate the potential reduction in variance by this proposed process, Section 2 reviews some basic concepts on variation in regression. Further, because there is some discord among the way terminology is used in calibration versus statistical regression, we provide a glossary of how several potentially conflicting terms are used. In Section 3, we review the concept of Tikhonov regularization as used in A&K and in our methods. Section 4 describes the details of our adaptive penalization approach while Section 5 provides the results of several simulations designed to illustrate the properties of this approach. Also included is an application of this method to a clinical study involving magnetic resonance spectroscopy data and their potential association with cognitive decline in HIV patients.
The following notation is used. Lower- and upper-case boldface letters are used to denote a vector (a) or matrix (A). Non-boldface letters denote a scalar or scalar-valued random variable, y. The transpose, inverse and generalized inverse of a matrix A are denoted by A⊤, A−1 and A†, respectively. In is the n × n identity matrix and ∥a∥ denotes the Euclidean 2-norm of a.
2. Calibration versus regression models
Multivariate calibration involves spectral processing for the removal of non-analyte signal and quantification of analyte concentrations. Multiple regression in a biomedical setting ideally begins with well-calibrated measurements — e.g., from metabolites quantified in magnetic resonance spectroscopy (MRS) data — and aims to infer potential associations of analyte concentrations with a clinically relevant outcome. For this, raw spectra containing a mix of analyte and interferent signal may be pre-processed with the goal of removing interferent signal and the analyte concentrations calibrated for subsequent use as covariates in a multiple regression model (see, e.g., Provencher [20]). Our approach is different: rather than pre-processing the spectra to remove noise, the estimation of β is performed within a penalized regression framework [18] that focuses the estimation of β away from non-analyte signal and toward analyte signal. For this, pure analyte spectra (forming the rows a new matrix Q) are used in conjunction with the observed spectra in X to construct the regression estimate in such a way that the associations between x and y (as represented by non-zero coefficients of ) are informed by both x and the pure metabolite spectra. As described here (and noted by A&K), this use of both X and Q can be made explicit using the generalized singular value decomposition (GSVD); see Section 3. In the remainder of this section we establish terminology on quantifying uncertainty in estimating a regression coefficient vector, [18].
2.1. Variance and bias in linear regression
While calibration models often focus on approximating the model vector, b, a statistical linear regression model aims to estimate a coefficient vector, β, as well as estimating the conditional mean and conditional variance of the response, y, given the covariates, X. For example, the variability of the response in OLS is modeled by a separate parameter which is independent of the covariates and homogenous across observations. Specifically, E(y|X) = Xβ and V ≡ Var(y|X) = σ2In, and so the goal is to estimate both the regression vector, β, and the variance of response, σ2. One may estimate β first and then estimate σ2 under the assumption that the variance does not depend on X. This variance is essential for subsequent inference regarding . For example, when n ≫ p, OLS is not only unbiased, , but it has the smallest variance among all unbiased estimators, . More generally, when the observations yi are not independent or have different variances, i.e., Var(y|X) = V ≠ σ2In, then no longer has the smallest variance, although it is still unbiased. In this case, a weighted least square estimate is unbiased and has the smallest variance among all unbiased estimators, hence may be preferred over OLS.
The variance of has both statistical and computational relevance. In situations for which is ill-conditioned or singular (e.g., p > n), the estimate is, respectively, highly variable or undefined. This computational instability is a reflection of large variance, , which is illustrated in the examples of Section 5. A more stable, but biased, estimate may be preferred and ridge regression achieves this: . The tuning parameter λ controls the balance between the bias and the variance, , respectively, where X♯ = (X⊤X + λI)−1X⊤. This tradeoff appears in the mean square error (MSE) of the estimate: . Ridge regression will reduce the variance while increasing the bias. In contrast, while the penalty learning process proposed in Section 4 also aims to reduce the variance, it aims to do so with a smaller increase in bias. In our estimation process, as well as in ridge regression or the more general Tikhonov regularization discussed in the next section, the choice of tuning parameter λ is important. For this, one may use a subjective method, e.g. the trace plot [14], or a more objective method, such as minimizing cross-validation prediction error [10]. For this, and for the more general Tikhonov regularized estimate discussed in the next section, we will describe a random-effects model for obtaining an objective, model-based estimate of λ.
Finally, for evaluating the performance of our approach in the simulations, we are interested in the pointwise squared error, . This can be summarized (pointwise) as: , which measures the accuracy and the precision of at each coordinate. A vector summary of the this is the “integrated” squared error, . And, viewing as a random vector, the mean of this is , i.e., the , as defined above. In the interest of a robust summary in the simulations, we will also consider the median and the interquartile range of ; see, e.g., [27, 7].
2.2. Terminology in calibration and regression
We have emphasized the role of variation in because it is inherent in the estimation process for the statistical regression model y = X β + ε in which y is a sample-specific trait that is assumed to be measured with error. In contrast, a calibration model Xb = y+e assumes y is known physical property that is quantified without error. In particular, these models focus on different sources of error: in a calibration model X is random while in a (frequentist) regression model y is random and the measurements in X (the “variables”) are typically assumed to be measured without error. Consequently, when an estimate is used to infer which measurements in X that are associated with y, the uncertainty in this estimation process is used to define statistical significance.
The mathematical framework for these two problems, however, are the same. The work of A&K, and references therein, provide valuable insight and mathematical clarity on the various perspectives and approaches used for calibration models. For additional perspective, see [16]. One of our goals is to provide parallel insight and options for the context of estimation in a statistical regression model. Because research on calibration and statistical regression often operate in parallel with limited crossover in the literature, the emphases and terminologies sometimes differ. Therefore, in order to facilitate communication it is useful to make explicit the definitions and concepts as they are used in this paper. This is done in Table 1.
Table 1:
Summary of terms used for calibration versus regression modeling.
| terminology | calibration model | statistical regression model |
|---|---|---|
| model | ||
| response | y, a known physical property, quantified without error | y, a quantified trait, observed with error |
| estimate | model vector; a non-random approximation of b |
regression coefficient vector. A random vector estimated with uncertainty |
| error |
e, model approximation error; |
∈, random error in y; |
| estimation variance | — | Var() |
| bias | ||
| mean squared error of estimation (MSE) | — | |
| prediction | ||
| mean squared error of prediction | ||
| prediction variance | proportional to | |
| interferent | non-analyte signal or spectral noise | measurement noise |
| inference | — | testing regions for which β = 0 |
3. Regularization methods in calibration and regression
We briefly review concepts specific to multivariate calibration, as discussed by A&K, who provide an excellent mathematical unification of various spectral processing methods: generalized net-analyte signal (GNAS), generalized least-squares (GLS), generalized Tikhonov regularization (GTR). A&K highlight the difference between the concepts of “spectral preprocessing” versus “inverse processing”. Spectral pre-processing refers to two-step methods in which the spectra in X are first pre-processed via post-multiplication, , and then the linear system is solved. In contrast, inverse processing methods allow for concurrent spectral processing and model building.
More specifically, if P = I−L†L, where L is is a matrix of interferent spectra — non-signal output corresponding to spectral interference — then XP is the projection of the calibration spectra onto the orthogonal complement of the interferent space. As shown by A&K, the methods of NAS, GNAS and GLS are distinguished simply by how this projection is weighted. A&K then proceed to define spectral processing via augmentation where the interferent operator, L, serves the role of penalty operator in GTR:
| (1) |
In particular, the filtering out of interferent signal is performed concurrently with the formation of the model vector via the GTR model (1). The terminology “inverse modeling” arises from the property that the model vector comes from a closed-form inverse expression of the form , for some g. A&K have shown how this is true for each of GNAS, GLS and GTR. GTR is of particular interest for us as one can show that expression in (1) is equivalent to (X⊤X + λL⊤L)−1X⊤y; see, e.g., [11].
3.1. The GSVD in Tikhonov regularization
As noted by A&K, there is a rigorous mathematical underpinning for this provided by the generalized singular value decomposition (GSVD). Recall, first, that a ridge estimate (L = I) can be expressed explicitly in terms of the singular vectors of X as , where σk is the k−th largest singular value of X, and uk and vk are the corresponding left and right singular vectors, respectively, assuming rank(X) = n. Now, given X and L ≠ I, the GSVD assures a simultaneous diagonalization of these two matrices as
| (2) |
where U and V have orthonormal columns, S and M are diagonal and W is nonsingular. By convnetion, the generalized singular values are ordered as: 0 ≤ σ1 ≤ σ2 ≤ … ≤ σr ≤ 1, 1 ≥ μ1 ≥ μ2 ≥ … ≥ μr ≥ 0, where ; r depends on the rank of X and L. Denote the columns of U, V and W by uk, vk and wk, respectively. In this convention for ordering the GS values and vectors, the last few columns of W span the subspace . If , these correspond to the smallest generalized singular values, μk. We set d = dim(Null(L)) and note that μk = 0 for k > n − d. See [11] or [4] for details.
From now on we will focus on the estimation of β in the regression model y = Xβ + ε, although all the properties discussed above apply to the mathematically equivalent process of approximating a calibration model vector b in (1). The L-penalized estimate can be expressed as a series in terms of the GSVD as
| (3) |
This is an expansion with respect to the generalized singular vectors which correspond to a new basis for the estimation process: the estimate is expressed in terms of the generalized singular vectors {wk} determined jointly by X and L. This property motivated the terminology of “partially empirical eigenvectors for regression” (PEER) [22] in order to distinguish it from the purely empirical (from X) eigenvectors that make up the regression estimates from ordinary least squares, partial least squares, ridge and principal component regression. We emphasize that PEER is mathematically identical to GTR, but the two contexts in which they are applied—calibration versus regression—are actually addressing different questions and so we will adopt both terminologies here in order to distinguish between the two settings.
In Section 4 we introduce a adaptive version of this regularization process. For brevity, we will refer to this proposed method for obtaining a refined penalty L and its corresponding estimate as “supervised PEER”, or SuPEER.
3.2. Two perspectives on penalized regularization
A&K discuss general approaches to removing interferent contributions from spectra and focus on the theoretical relationships between GNAS processing, GLS and GTR. As noted, the proposal for GTR is mathematically equivalent to our statistical modeling approach, but in addition to the different contexts of calibration versus regression, the premise is different: GTR aims to filter out non-analyte or interferent signal with a penalty term whereas we focus the estimation process toward analyte-specific signal. Moreover, A&K work in the context of calibration where y is known (presumably without error) and b is viewed as a deterministic vector whose true structure is inherent in X, albeit contaminated with some interferent structure. In particular, b is presumed to be a non-zero vector. Our goal, on the other hand, is ultimately to infer which properties of spectroscopy signal, x, are associated with the outcome, y. The difference between these two setting is highlighted by the fact that there may be no association; indeed, our null hypothesis is β ≡ 0.
In spirit, our goal is much the same as in A&K, but the goal is accomplished from a reverse perspective. To quote A&K, the constraint Lb = 0 attempts to immunize the model vector against noise (interferent) by orthogonally pointing away from the interferent space. We adopt the alternative perspective that the penalty term serves to focus the estimation process in a manner that encourages to be in or near a “favored” subspace . This, and a variety of perspectives on regularization, can be found in the references cited in the introduction.
In general, the least-dominant eigenvectors of a penalty operator L have the largest effect on the estimate . This observation can be used to construct a “favored subspace” by defining a penalty L whose least-dominant eigenvectors (those corresponding to the smallest eigenvalues) are most relevant to the estimation process. One possible approach is as follows: (1) Identify a subspace where β is likely to belong and treat this as a favored space; (2) define a decomposition-based penalty, L that penalizes more when the estimate falls into the orthogonal complement of this favored space .
For intuition about how a penalty operator can be used in this way, it is useful to recall the familiar example of a Laplacian, or second-derivative, L = D2. One heuristic for this penalty arises from viewing β as a function whose local “smoothness” is presumed. In this case, the term ∥D2β∥2 penalizes sharp changes in . Alternatively, recall that the dominant eigenvectors of D2 are sharply oscillatory while the least dominant eigenvectors are very smooth. Hence, a linear-algebraic view of this is that rather than penalizing sharp changes, smoothness in the estimate is inherited from the eigenproperties of D2. Specifically, structure in arises from the joint eigenproperties of X and D2. This heuristic is formalized by the GSVD representation in (2)–(3).
4. Favored spaces and adaptive penalties
As described in Section 2.1, an important motivation for adding a penalty term to a least-squares loss function is to stabilize or reduce the variance of the estimation process. A ridge penalty imposes no structure and simply shrinks all coordinates in equally, producing a biased but more stable estimate of β. More generally, a user-defined penalty in terms of L in (3) can be employed to guide the estimation process toward a subspace in constructed to be orthogonal to spectral structure that is irrelevant to the problem. We refer to such a subspace as “favored” based on the idea that it should contain the “true” β. Using prior knowledge and/or data to inform the construction of , one can construct a penalty operator L as follows. Let denote a projection onto . If the true regression vector β resides in or near then the orthogonal complement, , should be more heavily penalized than . Therefore, we also consider the projection, , onto the orthogonal complement, . For practical reasons, an invertible penalty operator is often preferred, so we define
| (4) |
for some 1/2 ≤ a ≤ 1. This can be viewed as a weighted average of two orthogonal projections. Note that when a = 1/2, is simply a ridge penalty. On the other hand, when a = 1, then penalizes only the orthogonal complement of the favored space. In this case, if , then and no penalty is applied to the true β. Hence an estimate derived from using in (3) is unbiased. However, if is too large, such an unbiased estimate will have a large variance, making the estimate useless in practice. For example, in the most extreme case of , we have , which penalizes nothing and, when p > n, the OLS estimate is undefined. For this reason, we may wish to initially cast a wide net by defining to be large, then iteratively refine this favored space so as to more parsimoniously represent the preferred structure.
4.1. Penalty learning
To implement this type of penalty, we begin with a large collection of pure analyte spectra or some other informed, pre-collected set denoted by S(0) = {q1, … , qd}, where qjs are linearly independent p−dimensional vectors. Let Qd be the d × p matrix whose rows are formed by the vectors in S(0). Then the initial favored space is d−dimensional and . By a slight abuse of notation, we will use Qd for both the matrix and the favored space. Then an orthogonal projection onto Qd can be defined as
| (5) |
Here, the Moore-Penrose inverse since the rows of Qd are linearly independent. An important special case is when S(0) contains orthonormal qjs. Then and the projection can be simplified to . Replacing by Qd in (4), we denote the initial penalty term by L(a, Qd). When 1/2 ≤ a < 1, L(a, Qd) is invertible with
| (6) |
Furthermore, the initial estimate for β using (3) is
| (7) |
To simplify notation, we suppress the parameters in when the context is clear.
Next, we examine the favored space for similarity between the vectors qj and the regression coefficient estimate . Define
| (8) |
where . Now we define a similarity coefficient between and qj in S(0) as
| (9) |
and use it to sort qj decreasingly. Denoted the ordered vector as q(j), which has the j−th largest . To trim the initial favored space and improve the penalty, we remove the q(d) with the smallest and denote the set of remaining vectors as S(1) = {q(1), …, q(d−1)}. Heuristically, if a qj is orthogonal to then β is less likely to reside in the linear space that contains that qj and one should exclude it from Qd. Because the absolute value of δj is affected by the size of the j−th term, we normalize it by the standard deviation of X in the direction of qj, which is quantified by σj.
With the remaining d − 1 vectors in S(1), we refine the favored space
and use this trimmed space to reconstruct the penalty L(a, Qd−1) via (4) and (5). Then, we update the estimate of β using the new penalty L(a, Qd−1) as in (7), and denote it by . In (8)–(9), we recalculate the similarity coefficients between and qj in S(1), for j = 1, … , d − 1. The q vectors in S(1) can be then reordered based on . Removing the qj with the smallest , we have S(2) and Qd−2 = span S(2), a further refined favored space.
Iterating the above procedure and trimming off one q vector at a time, we obtain
which produces a sequence of adaptive penalties. Consequently, we can update the adaptively penalized regression coefficient estimates as
| (10) |
for k = 0, … , d. Here the tuning parameter a, which balances the fidelity to the observed data and the loyalty to the subjective prior information in q vectors, can be chosen subjectively and may increase as the iteration proceeds. From our experience in simulations, we find little difference in estimation when a is sufficiently close to 1. Therefore, a fixed large a is used in our simulations and data analysis. The detailed stopping rule will be discussed in Section 4.3.
4.2. A Bayesian interpretation of the penalty learning process
Another way to interpret the adpative penalty and derive the penalized least square estimate in (10) is to view the penalty term as controlling variation of β [8, 2, 17, 19]. From a Bayesian perspective, the estimate depends not only on data but also the prior distribution of the parameter. Heuristically, the prior distribution may shrink the estimate toward a smaller region of nominally large parameter space, hence reducing its variability. This is particularly useful when the unknown model parameters lie in a large sample space and the observed data do not contain sufficient information to accurately estimate the parameters. This is exactly what we need for the examples we are considering. More precisely, if we assume that
| (11) |
then the posterior mean of β given y, E(β|y), is a Bayes estimate for β under square loss function. Furthermore, using the properties of normal distribution, one can derive the posterior distribution of β and find E(β|y) = [X⊤x + λL⊤L]−1X⊤y, which is same as the penalized least square solutions in (3) or (10).
In our setting, the choice of the prior distribution is equivalent to choosing L in (11). To gain insight on the influence of L on β, consider the simple case of L = I, so that β ~ N(0, (2λ)−1I). This is equivalent to a ridge penalty which assumes that all components of β are independent with variances proportional to 1/λ. A smaller λ will lead to larger variance of all components of β, and cause the estimator to depend more on the observed spectra, X, and the response, y. On the other hand, a larger λ will force β closer to its prior mean, 0. More generally, assuming equal variance for all components of β and shrinking them all toward 0 might not be the best practice, especially if we have some prior information about β.
We now incorporate an informative prior via L. Let Qd denote a favored space presumed to contain β. We decompose β = (I − PQd)β + PQdβ, where the first term is the portion of β outside of Qd and the latter portion is in Qd. Intuitively, we should assign a smaller variance for the first term and larger for the second, constraining the portion of β outside of Qd more heavily. The L defined in (4) accomplishes this goal:
Evaluating the overall variance in Qd and , we obtain
This implies that the average variance of β in each direction of Qd is (1 − a)−2, and in each direction of is a−2. Note a−2 < (1 − a)−2 for 1/2 < a < 1. So, loosely speaking, we start with a large favored space Qd and allow a larger variance for the portion of β in the favored space, i.e., PQdβ; this allows to depend more on the data, X and y. On the other hand, for the portion of β outside of Qd, i.e., (I − PQd)β, we assign smaller variance and shrink this more toward 0; this is the prior mean of β. As we iteratively trim Qd and reduce its dimension, we allow to depend on the data in a few supervised directions and restrict its variability in other directions.
4.3. REML estimation and stopping rule
The Bayesian perspective of adaptive penalties also provides a way to compute the proposed . For an invertible L defined in (4), if we define θ := Lβ, then the model (11) becomes y|X, θ ~ N(X* θ, σ2I), θ ~ (0, (2λ)−1I), where X* = XL−1 and L−1 is given in (6). Hence we may first estimate θ through a standard ridge regression , and then transform back to β using β = L−1 θ = a−1(I − PQd)θ + (1 − a)−1PQdθ
Furthermore, the regression coefficients in a ridge estimate can be obtained by using an equivalence with the linear mixed model formulation; see, for example, [23] or [19]. Such a representation facilitates a straightforward use of existing linear mixed model routines widely available in software packages (e.g., R [21] or SAS [24]). Importantly, this formulation also provides an estimate of the tuning parameter λ via the restricted maximum likelihood (REML), in which λ is estimated as a ratio of the variances of the errors, ε, and the regression coefficients, β. This REML-based estimation of tuning parameter λ is used in the application of Section 5. In addition, a value of the estimated λ can be used for stopping rule: stop the iteration when λ is minimized.
4.4. Some properties of SuPEER
To understand the theoretical properties of SuPEER, we investigate the difference between a SuPEER estimate and the target or “true” β. Let Qd* be the final, selected favored space with dimension d* and L = L(a, Qd*). Then
| (12) |
Where
Next, we will show that the difference in (12) is actually very small, provided λ and a are appropriately chosen. First, as n → ∞. Then, assuming that there is a constant c such that 0 < c = lim λ/n < ∞, we have
for 1/2 ≤ a < 1. Hence A > 0 and has a finite inverse. (For the limiting case a = 1, A is only invertible when Null(X) ∩ Null(L) = {0}, which implies d* > p − n. ) Second, for the random term B, note that
under the model E(y|X) = Xβ and Var(y|X) = σ2V. We use notation Op(·) and op(·) to denote convergence in probability. Finally, the term . Assuming that a → 1 faster than , we have C → c(I − PQd*)β as n ! 1. If the favored space is correctly identified, i.e., β ∈ Qd*, then C = 0. Putting the three terms together, we have
and
Hence the SuPEER estimate is a consistent estimator of β, which implies that the estimate approaches the true coefficient vector as the sample size increases.
5. Numerical examples
The SuPEER approach is illustrated on simulated data and using real spectroscopy data from a cohort study. In the two simulation studies, we compared it to ordinary ridge regression, generalized ridge (PEER) and a two-step estimation process in which features in spectra are first extracted and then used as regressors in an ordinary linear regression model. In the first simulation study, we assessed performance by comparing SuPEER with the two-step method by looking at the number of correct basis functions that each method chose from a predefined favored space. In the second simulation study, we compared SuPEER, PEER and ordinary ridge regression in terms of squared error of estimation, .
5.1. Simulation study
In subsection 5.1.1 we describe how we simulated the spectroscopy data and how we used various forms of a predefined β to construct a “true” association with an outcome, y. Spectroscopy curves, x, were simulated to loosely resemble noisy MRS spectra. In our regression setting, these are sometimes referred to as “predictor functions” and may be as used to predict an outcome such as neurocognitive impairment of HIV patients. Our interests are focused on discovery of biomarkers (e.g., brain metabolites) associated with cognitive decline and hence in our simulations we do not evaluate prediction performance, but rather focus on the error observed while estimating the “true” coefficient vector, β. Subsection 5.1.2 outlines the various methods used to estimate β while a comparison of the results is discussed in subsection 5.1.3.
5.1.1. Data simulation
Spectra were simulated as an affine combination of “bump functions” with varying support at a number of pre-specified locations, with gaussian noise added to simulate measurement noise. The “true” coefficient vector was defined to correspond to some, but not all, “bumps” in these simulated spectra. Specifically, we constructed each spectrum x to consist of narrow, wide, and moderate bumps centered at: Hnarrow = {0.40, 0.90}, Hwide = {0.50}, and Hmoderate = {0.05, …, 0.95} (increments of 0.05). We generated the subject-specific spectra xi(s) to be of the form
| (13) |
where {aw,h, cw,h}, {am,h, cm,h} and {an,h, cn,h}correspond to the amplitude and degree of curvature at the wide, moderate and narrow bumps, respectively. The values are described in Table 2. For both the predictor and regression vectors, we sampled s at either p = 200 or p = 400 equispaced sampling points in the interval [0, 1]. Note that the amount of curvature in xi(s) differs at some locations, s, from the amount in β. ξw,h, ξm,h and ξn,h were drawn independently from Uniform(0, 0.1). Figure 1 exhibits a few examples of these xi.
Table 2:
The amplitude and curvature used to generate the predictors x and regression vector β. Parameters defining features in equation (14) are specified in columns Wide, Moderate and Narrow. Hβ represents the set of sampling points corresponding to bumps in β. Note that the degree of curvature in β may be different from the one assumed in x. We considered two scenarios for β curvature at s = 0.5: (i) C = 1000 (same curvature as x) and (ii) C = 500 (different curvature from x).
| Wide | Moderate | Narrow | Hβ | |||||
|---|---|---|---|---|---|---|---|---|
| amp | curv | amp | curv | amp | curv | amp | curv | |
| h | (aw,h) | (cw,h) | (am,h) | (cm,h) | (an,h) | (cn,h) | (a0,h) | (c0,h) |
| 0.05 | 1 | 1000 | ||||||
| 0.10 | 1 | 1000 | ||||||
| 0.15 | 1 | 1000 | ||||||
| 0.20 | 1 | 1000 | 0.15 | 1000 | ||||
| 0.25 | 1 | 1000 | ||||||
| 0.30 | 1 | 1000 | ||||||
| 0.35 | 1 | 1000 | 0.12 | 1000 | ||||
| 0.40 | 0.25 | 2500 | ||||||
| 0.45 | 1 | 1000 | ||||||
| 0.50 | 1 | 500 | −0.10 | C | ||||
| 0.55 | 1 | 1000 | ||||||
| 0.60 | 1 | 1000 | 0.08 | 1000 | ||||
| 0.65 | 1 | 1000 | ||||||
| 0.70 | 1 | 1000 | 0.06 | 1000 | ||||
| 0.75 | 1 | 1000 | ||||||
| 0.80 | 1 | 1000 | 0.06 | 1000 | ||||
| 0.85 | 1 | 1000 | ||||||
| 0.90 | 0.25 | 2500 | ||||||
| 0.95 | 1 | 1000 | ||||||
amp: Amplitude; curv: Curvature
Figure 1:
n = 5 sample predictor curves x sampled at p = 400 points.
The following model was used to generate the outcome data for n independent samples:
| (14) |
Also, β0 = 0.0, εi ~ N[0, σ2], where σ2 corresponded to the signal to noise ratio defined as the ratio of variances of the true responses (noiseless version of y = Xβ) to the total variance of observed y (R2) of 0.8 and 0.9.
5.1.2. Model fitting
We apply both the SuPEER and PEER methods using the decomposition based penalty LQ in (4). We define the discretized functions q1, … , qd spanning the favored subspace (PEER or a starting point for SuPEER), where each qj is defined to have a single bump (Hmoderate). We compute PQ using (5). Note that the rows of Q need not be orthogonal, just linearly independent. For a ridge penalty, L = Ip. The two-step procedure is defined as: (i) extract d = 19 “features” from each xi by regressing each onto the span of vectors; and (ii) regress the outcome y on the resulting set of d = 19 covariates.
In the first simulation study, proper selection of the penalized space was of primary interest. We summarize the performance of SuPEER by presenting the percentage of correctly chosen qj vectors and compare it with the two-step procedure over 250 simulation runs.
In the second simulation study, our primary interest lies in estimation of the regression vector β. We summarize the estimation error using a scalar summary, namely integrated squared error (ISE). We report its median and interquartile range over 250 simulation runs.
The estimates of β from the SuPEER, PEER and ridge methods are obtained as best linear unbiased predictors (BLUPs) from the linear mixed model equivalent formulation of the penalized least squares problem (see Section 4.3).
5.1.3. Performance evaluation
In the first simulation study, we evaluated the fidelity of the chosen basis functions and compared it with the two-step estimation procedure. Table 3 summarizes the dimensionality of the Q space and the number of significant features chosen in the two-step procedure, while Figure 2 shows the relative frequencies of the features selected in one specific simulation setting of n = 100, p = 400 and R2 = 0.9. In all settings considered, the SuPEER method chooses the correct dimension of the Q space most of the time; median = 6 and a very narrow interval for the quartiles ranging in most cases from either 5 to 6 or from 6 to 7. Two-step procedure selects too many features in the first step; median between 12 and 14 with a higher variability.
Table 3:
Median dimension of Qd* with first and third quartiles in parentheses as selected by the SuPEER method and the number of significant covariates from the first stage of the two-step procedure, under various simulation scenarios: strengths of signal, R2, sampling points, p, and whether the curvature in simulated β is the same or different as in x at s = 0.5.
| R2 | p | β | d* in SuPEER | # covariates in two-step |
|---|---|---|---|---|
| 0.9 | 400 | same | 6 (6, 7) | 13 (13, 14) |
| 0.9 | 200 | same | 6 (6, 6) | 13 (13, 14) |
| 0.8 | 400 | same | 6 (5, 6) | 12 (10, 12) |
| 0.8 | 200 | same | 6 (5, 6) | 12 (11, 13) |
| 0.9 | 400 | different | 6 (6, 7) | 14 (13, 15) |
| 0.9 | 200 | different | 6 (6, 7) | 14 (14, 15) |
| 0.8 | 400 | different | 6 (5, 6) | 12 (11, 13) |
| 0.8 | 200 | different | 6 (5, 7) | 12 (11, 13) |
Figure 2:
Selection frequency (out of 250 simulations) of the features in Q for SuPEER (black bars) versus the number of the significant factors selected in the two-step procedure (red bars) for the simulated data with n = 100, p = 400 and R2 = 0.9.
In the second simulation study, we compared the estimation using ISE. Table 4 gives a summary of estimation errors. In all settings we considered, the methods based on the incorporation of external information (PEER and SuPEER) outperform the ordinary ridge regression by orders of magnitude.
Table 4:
Median integrated squared error, med ISE, with first and third quartiles for three methods in simulation with selected bump locations. The simulation scenarios include varied strengths of signal, R2, sampling points, p, and whether the curvature in simulated β is the same or different as in x at s = 0.5. The sample size is n = 100 in each case.
| R2 | p | β | SuPEER med ISE | PEER med ISE | ridge med ISE |
|---|---|---|---|---|---|
| 0.9 | 400 | same | 0.020 (0.011, 0.038) | 0.061 (0.046, 0.082) | 1.852 (1.806, 1.898) |
| 0.9 | 200 | same | 0.018 (0.010, 0.032) | 0.061 (0.046, 0.075) | 1.359 (1.285, 1.441) |
| 0.8 | 400 | same | 0.068 (0.035, 0.177) | 0.123 (0.106, 0.159) | 1.940 (1.894, 1.997) |
| 0.8 | 200 | same | 0.063 (0.030, 0.175) | 0.120 (0.099, 0.159) | 1.526 (1.448, 1.609) |
| 0.9 | 400 | different | 0.052 (0.041, 0.072) | 0.068 (0.051, 0.083) | 1.955 (1.898, 2.009) |
| 0.9 | 200 | different | 0.050 (0.039, 0.074) | 0.069 (0.052, 0.086) | 1.441 (1.352, 1.528) |
| 0.8 | 400 | different | 0.186 (0.069, 0.252) | 0.135 (0.119, 0.173) | 2.058 (2.012, 2.114) |
| 0.8 | 200 | different | 0.169 (0.069, 0.253) | 0.133 (0.109, 0.175) | 1.613 (1.534, 1.705) |
In some settings PEER outperforms SuPEER (last 2 lines of Table 4). Here, the favored space with PEER is all of Qd, whereas the SuPEER favored space is reduced to Qd* (of dimension d* < d). If β is not contained in the original Qd, then the reduction to Qd* may result in an accumulation of error in some regions of (those not represented by Qd), and this error may exceed the gains obtained from a better SuPEER fit to β than PEER in other regions. This is exhibited in the tradeoff between the pointwise bias and variance as displayed in Figure 4.
Figure 4:
Top: a decomposition of pointwise MSE for SuPEER as: pointwise squared bias (black) and pointwise variance (red). Bottom: pointwise median ISE for SuPEER (black) and PEER (red), as summarized in Table 4.
5.2. Spectroscopy data example
We consider data from a cross-sectional study of n = 114 individuals chronically infected with HIV in which metabolite spectra were obtained from the basal ganglia. In this study, magnetic resonance spectroscopy (MRS) curves were sampled at p = 399 distinct frequencies and a marker of neurocognitive impairment, called the global deficit score (GDS) [6], was measured for each subject. An increased GDS score indicates a high degree of impairment. A detailed study description can be found in [12]. Here we are interested in finding metabolite markers of neurocognitive impairment by modelling the association of the metabolite spectra, , with the global deficit score, .
The MRS spectra are comprised of pure metabolite spectra, instrument noise and a background profile. Figure 5 shows a sample of 10 subjects’ spectra. Spectral information from the following d = 9 pure metabolites was used as to define an initial favored subspace for the SuPEER estimation procedure: Creatine (Cr), Glutamate (Glu), Glucose (Glc), Glycerophosphocholine (GPC), myo−Inositol (Ins), N-Acetylaspartate (NAA), N-Acetylaspartylglutamate (NAAG), scyllo−Inositol (Scyllo) and Taurine (Tau). All pure metabolite spectra are displayed in Figure 6. It is worth observing that there are several prominent features (peaks) in the pure metabolite spectra as well as regions which contain mostly measurement error. We applied SuPEER to select from the favored space and estimate a regression vector, β, that relates the metabolite spectra to the outcome, GDS. In the final solution, d* = 3 pure metabolite spectra were selected for the Qd* space (see the top panel of Figure 7) and the estimated regression function β (bottom panel of Figure 7) corresponds, in most part, to the selected vectors spanning the Q space. Finally, we note that the two-step procedure provided a solution with no significant predictors; i.e., this procedure resulted in a null model.
Figure 5:
Sample spectra for 10 subjects.
Figure 6:
d = 9 pure metabolite spectra.
Figure 7:
Estimate of the regression function β (bottom panel) with the 3 pure metabolite spectra spanning the favored Qd* space (top panel).
6. Discussion
We have aimed to provide a bridge between two inverse problems: approximating a model vector in a calibration model, and estimating a coefficient vector in a statistical regression model. Our starting point was the mathematical equivalence between GTR (for calibration) and PEER (for regression). Although the mathematical formulations for these two methods are identical, their goals and motivations are different. In reviewing and connecting these concepts we have aimed to improve on this framework to be more flexible in formulating an adaptively modified penalization process that allows for a less rigid pre-specification of prior information. Our simulations and example illustrate this approach for estimation in a statistical regression model and we imagine that these concepts may find application in the context of approximating a model vector in a calibration model. Among the proposed tools is a method of selecting the tuning parameter that is implemented by way of the linear mixed effects model framework. This may also prove useful in calibration settings.
The empirical results from our simulations suggest our choices worked well. In particular, in the MRS example, the two-step method found no significant pre-selected features associated with GDS. There are, however, several potential limitations of our approach. These include its strong dependence on defining the initial favored space, Q. Although the decomposition-based penalty in (4) aims to alleviate a complete dependence, there is still the issue how these two terms should be weighted. Additionally, although the adaptively refined favored space, as described in Section 4, also reduces the dependence on this choice, the stopping criteria we present here is not backed up by theoretical properties. These and additional theoretical aspects of the procedure require further investigation.
Figure 3:
The true regression vector (broken thick black line) with 10 sample estimates from: ridge (top), SuPEER (middle), PEER (bottom)).
7. Acknowledgements
Research support for TR, MK and JH was provided by the National Institutes of Health grant R01 MH108467 (Harezlak); TR was additionally supported by R01 GM114029 (Shojaie).
References
- [1].Andries E and Kalivas JH, Interrelationships between generalized tikhonov regularization, generalized net analyte signal, and generalized least squares for desensitizing a multivariate calibration to interferences, Journal of Chemometrics 27 (2013), no. 5, 126–140. [Google Scholar]
- [2].Rice JA Brumback BA, Smoothing spline models for the analysis of nested and crossed samples of curves, Journal of the American Statistical Association 93 (1998), no. 443, 961–976. [Google Scholar]
- [3].Bertero M and Boccacci P, Introduction to Inverse Problems in Imaging, Institute of Physics, Bristol, UK, 1998. [Google Scholar]
- [4].Björck A, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [Google Scholar]
- [5].Brown P, Measurement, Regression and Calibration, Oxford University Press, Oxford, UK, 1993. [Google Scholar]
- [6].Carey CL, Woods SP, Gonzalez R, Conover E, Marcotte TD, Grant I, and Heaton RK, Predictive validity of global deficit scores in detecting neuropsychological impairment in hiv infection, Journal of clinical and experimental neuropsychology 26 (2004), no. 3, 307–319. [DOI] [PubMed] [Google Scholar]
- [7].Choi E, Hall P, and Rousson V, Data sharpening methods for bias reduction in nonparametric regression, Annals of Statistics (2000), 1339–1355. [Google Scholar]
- [8].Smith AFM Lindley DV, Bayes estimates for the linear model, Journal of the Royal Statistical Society. Series B (Methodological) 34 (1972), no. 1, 1–41. [Google Scholar]
- [9].Engl HW, Hanke M, and Neubauer A, Regularization of inverse problems, Kluwer, Dordrecht, Germany, 2000. [Google Scholar]
- [10].Golub GH, Heath M, and Wahba G, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979), no. 2, 215–223. [Google Scholar]
- [11].Hansen PC, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, Philadelphia, PA, 1998. [Google Scholar]
- [12].Harezlak J, Buchthal S, Taylor M, Schifitto G, Zhong J, Daar E, Alger J, Singer E, Campbell T, Yiannoutsos C, et al. , Persistence of HIV-associated cognitive impairment, inflammation, and neuronal injury in era of highly active antiretroviral treatment, AIDS 25 (2011), no. 5, 625–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Hastie T, Tibshirani R, and Friedman J, The elements of statistical learning: Data mining, inference, and prediction, Springer-Verlag, New York, 2011. [Google Scholar]
- [14].Hoerl AE and Kennard RW, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), no. 1, 55–67. [Google Scholar]
- [15].Kalivas JH, Overview of two-norm (l2) and one-norm (l1) tikhonov regularization variants for full wavelength or sparse spectral multivariate calibration models or maintenance, Journal of Chemometrics 26 (2012), no. 6, 218–230. [Google Scholar]
- [16].Kalivas JH and Palmer J, Characterizing multivariate calibration tradeoffs (bias, variance, selectivity, and sensitivity) to select model tuning parameters, Journal of Chemometrics 28 (2014), no. 5, 347–357. [Google Scholar]
- [17].Lin X and Zhang D, Inference in generalized additive mixed modelsby using smoothing splines, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 (1999), no. 2, 381–400. [Google Scholar]
- [18].Neter J Li W, Kutner M, Nachtsheim C, 5th edition ed., McGraw-Hill/Irwin, 2004. [Google Scholar]
- [19].Muñoz Maldonado Y, Mixed models, posterior means and penalized least-squares, Lecture Notes–Monograph Series, vol. Volume 57, pp. 216–236, Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2009. [Google Scholar]
- [20].Provencher SW, Estimation of metabolite concentrations from localized in vivo proton NMR spectra, Magnetic Resonance in Medicine 30 (1993), no. 6, 672–679. [DOI] [PubMed] [Google Scholar]
- [21].R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3–900051–07–0. [Google Scholar]
- [22].Randolph TW, Harezlak J, and Feng Z, Structured penalties for functional linear models—partially empirical eigenvectors for regression, Electronic Journal of Statistics 6 (2012), 323–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Ruppert D, Wand MP, and Carroll RJ, Semiparametric regression, Cambridge University Press, New York, 2003. [Google Scholar]
- [24].SAS Institute Inc., SAS/STAT software, version 9.2, Cary, NC, 2008. [Google Scholar]
- [25].Tikhonov AN and Arsenin VA, Solution of ill-posed problems, Winston and Sons, 1977. [Google Scholar]
- [26].Wahba G, Spline models for observational data, vol. 59, Society for Industrial Mathematics, Philadelphia, 1990. [Google Scholar]
- [27].Wand MP and Jones MC, Kernel smoothing, Chapman and Hall/CRC Press, 1994. [Google Scholar]







