Abstract
This paper presents a technique based on the intuitively-simple concepts of Sample Domain and Effective Prediction Domain, for dealing with linear regression situations involving collinearity of any degree of severity. The Effective Prediction Domain (EPD) clarifies the concept of collinearity, and leads to conclusions that are quantitative and practically useful. The method allows for the presence of expansion terms among the regressors, and requires no changes when dealing with such situations.
Keywords: collinearity, efficient prediction domain, ill-conditioning, multicollinearity, regression analysis
Introduction
The scientists’ search for relations between measurable properties of materials or physical systems can be effectively helped by the statistical technique known as multiple regression. Even when limited to linear regression, the technique is often of great value, as we shall see below. Often, however, difficulties in interpretation arise because of a condition called collinearity. This condition, which is inherent in the structure of the design points (the X space) of the regression experiment, is often treated, at least implicitly, as a sort of disease of the data that is to be remedied by special mathematical manipulations of the data.
We consider collinearity not as a disease but rather as additional information provided by the data to the data analyst, warning him to limit the use of the regression equation as a prediction tool to specific subspaces of the X space, and telling him precisely what these subspaces are. Thus, collinearity is an indication of limitations inherent in the data. The statistician’s task is to detect these limitations and to express them in a useful manner. If this viewpoint is adopted, there is no need for remedial techniques. All that is required is a method for extracting the additional information from the data. We will present such a method.
The Model
We assume that measurements y have been made at a number of “x-points,” each point being characterized by the numerical values of a number of “regressor-variables” xj. We also assume that y is a linear function of the x-variables. The mathematical model, for p regressors, is:
| (1) |
where ϵ is the error in the y measurement. We denote by N the number of points, or “design points”, i.e., the combinations of the x’s at which y is measured.
Usually, the variable x1 is identically equal to “one” for all N points, to allow for the presence of a constant term. Then the expected value of y, denoted E(y), is equal to β1 when all the other x ‘s are zero. This point, called the origin, is seldom one of the design points and is, in fact, quite often far removed from all design points. In many cases this point is even devoid of physical meaning.
First Example:
Firefly Data
We present the problem in terms of two examples of real data. The first data set (Buck [1]1) is shown in table 1. It consists of 17 points and has two regressors, in addition to a constant term (x1 = 1). The measurement is the time of the first flash of a firefly, after 6:30 p.m. It is studied as a function of ambient light intensity (x2) and temperature (x3).
Table 1.
Data for firefly study.
| x1 | x2 | x3 | y |
|---|---|---|---|
| 1 | 26 | 21.1 | 45 |
| 1 | 35 | 23.9 | 40 |
| 1 | 40 | 17.8 | 58 |
| 1 | 41 | 22.0 | 50 |
| 1 | 45 | 22.3 | 31 |
| 1 | 55 | 23.3 | 52 |
| I | 55 | 20.5 | 54 |
| 1 | 56 | 25.5 | 38 |
| 1 | 70 | 21.7 | 40 |
| 1. | 75 | 26.7 | 28 |
| 1 | 79 | 25.0 | 38 |
| 1 | 87 | 24.4 | 36 |
| 1 | 100 | 22.3 | 36 |
| 1 | 100 | 25.5 | 46 |
| 1 | 110 | 26.7 | 40 |
| 1 | 130 | 25.5 | 31 |
| 1 | 140 | 26.7 | 40 |
Definition of Variables
y=time of First flash (number of minutes after 6:30 p.m.)
x2=light intensity (in metercandles, mc)
x3=temperature (°C)
Figure 1 is a plot of x3 versus x2. There is obviously a trend: x3 increases as x2 increases. The existence of a relation of this type between some of the regressor variables often causes difficulties in the interpretation of the regression analysis. To deal with the problem in a general way we propose a method based on two concepts. The first of these we shall call the “sample domain.”
Figure 1—
Sample domain.
For our data, the sample domain consists of the rectangle formed by the vertical straight lines going through the lowest and highest x2 of the experiment, respectively, and by the horizontal straight lines going through the lowest and highest x3, respectively (See Fig. 1). The concept is readily generalized to an X space of any number of dimensions, and becomes a hypercube in such a space. Note that the vertex B of the sample domain is relatively far from any of the design points. This has important consequences.
The regression equation
| (2) |
allows us to estimate y at any point (x1, x2, x3) (we recall that x1 = 1) and to estimate the variance of at this point. The point can be inside or outside the sample domain. Obviously the variance of , which we denote by , will tend to become larger as the point for which the prediction is made is further away from the cluster of points involved in the experiment. Therefore at the point B may be considerably larger than at points A, C, and D. Such a condition is associated with the concept of “collinearity.” We define collinearity, in a semi-quantitative way, as the condition that arises when for at least one of the vertices of the sample domain, is considerably larger than for the other vertices. The concept will become clearer as we proceed.
At any rate, the larger variance at one of the vertices of the sample domain is generally the lesser of two concerns, the other being that the regression equation, for which validity may have been reasonably firmly established in the vicinity of the cluster of experimental points, may no longer be valid at a more distant point. It is important to note that the evidence from the data alone cannot justify inferences at such distant points. In order to validate prediction at such points, it is necessary to introduce either additional data or additional assumptions.
For these reasons, we seek to establish a region in the X-space for which prediction is reasonably safe on the basis of the experiment alone. We call this the Effective Prediction Domain, or EPD.
The EPD is the second concept required for our treatment of collinear data. It is closely related to the first concept, the sample domain, as will be shown below.
Establishing the EPD
Our procedure consists of two steps, involving two successive transformations of the coordinate system. The original coordinate system in which the x-regressors are expressed is referred to as the X-system.
1. The Z System
The first step consists in a translation of the X-system (parallel to itself) to a different origin, located centrally within the cluster of experimental points (centering); and simultaneously by a rescaling of each x to a standard scale. The new system, called the Z-system, is given by the equations2
| (3a) |
| (3b) |
For Cj and Rj we consider two choices, which we call the Correlation Scale Transformation (CST) and the Range Midrange Transformation (RMT). We discuss first the Correlation Scale Transformation defined by the choice
| (4) |
where i = 1 to N.
It easily follows from (3b) that
| (5) |
It is then reasonable to choose a value K in (3a) equal to
| (6) |
so as to make
The values of Cj and Rj for the firefly data are given in table 2. Contrary to statements found in the literature (see discussion at end of this paper), the centering and rescaling defined by the Correlation Scale Transformation have no effect whatsever on collinearity. The location of the sample domain relative to the design points remains unchanged, though it is expressed in different coordinates.
Table 2.
Firefly data—parameters for correlation scale transformation.
| j | C | R |
|---|---|---|
| 1 | 0 | 4.123106 |
| 2 | 73.176471 | 135.264447 |
| 3 | 23.582353 | 10.073962 |
To arrive at an EPD, a second operation is necessary, viz. a rotation of the Z-coordinate system to a new coordinate system, which we shall call the W-system (of coordinates).
2. The W-System
The rotation from Z to W is accomplished by the method of Principal Components, or its equivalent, the Singular Value Decomposition (SVD). For a discussion of this method the reader is referred to Mandel [2]. Here we merely recall a few facts. Each w-coordinate is a linear combination of all z-coordinates given by the matrix equation:
| (7) |
where V is an orthogonal matrix.
In algebraic notation, eq (7) becomes
| (8) |
where the vkj are the elements of the V matrix. The vkj, for a given k, are simply the direction cosines of the wk axis with respect to the Z-system. Consequently,
| (9) |
Since the rotation is orthogonal, any two distinct w-axes, say wk and , are orthogonal and consequently:
| (10) |
For the firefly data, the V matrix is shown in table 3, and the complete set of z and w coordinates is given in table 4.
Table 3.
Firefly data—V matrix.
| j |
|||
|---|---|---|---|
| k | 1 | 2 | 3 |
| 1 | 0 | .7071 | .7071 |
| 2 | 1.000 | 0 | 0 |
| 3 | 0 | −.7071 | .7071 |
Table 4.
Firefly data—z and w coordinates (CST).1
| Point | z2 | z3 | w1 | w3 |
|---|---|---|---|---|
| 1 | −.3488 | −.2464 | −.4216 | .0724 |
| 2 | −.2822 | .0315 | −.1780 | .2219 |
| 3 | −.2453 | −.5740 | −.5800 | −.2324 |
| 4 | −.2379 | −.1571 | −.2800 | .0572 |
| 5 | −.2083 | −.1273 | −.2381 | .0573 |
| 6 | −.1344 | −.0280 | −.1156 | .0753 |
| 7 | −.1344 | −.3060 | −.3121 | −.1213 |
| 8 | −.1270 | .1904 | .0440 | .2245 |
| 9 | −.0235 | −.1869 | −.1495 | −.1155 |
| 10 | .0135 | .3095 | .2276 | .2094 |
| 11 | .0431 | .1407 | .1292 | .0691 |
| 12 | .1022 | .0812 | .1289 | −.0148 |
| 13 | .1983 | −.1273 | .0495 | −.2302 |
| 14 | .1983 | .1904 | .2741 | −.0055 |
| 15 | .2722 | .3095 | .4106 | .0264 |
| 16 | .4201 | .1904 | .4309 | −.1624 |
| 17 | .4940 | .3095 | .5674 | −.1304 |
| λ1 = 1.6549 | λ3 = .3451 |
for all i
for all i, λ2 = 1.0000
Note that row 2, as well as column 1, in table 3 consists of the element “one” in one cell and zeros in all others cells. This is a consequence of the orthogonality of z1 with respect to all zj with j > 1. This orthogonality is in turn due to the nature of the Correlation Scale Transformation, as expressed by eq (4).
At the bottom of the w columns we find values labeled 𝜆j. They are simply the sums of squares of all w-values in that column.
| (11) |
The 𝜆j are also the eigenvalues of the Z’Z matrix which, for our choice of Cj and Rj, is the correlation matrix of the regressors x. Note that w2 is the . Consequently
We need to consider w1 and w3 only. A similar situation applied to the z coordinates, where for all i. Figure 2 shows both the z-coordinates (z2 and z3) and the w-coordinates (w1 and w3) for the firefly data. The order of the w-coordinates (w1, w2, w3) is that of the corresponding λ-values, in decreasing order.
Figure 2—
EPD for firefly data.
3. The Effective Prediction Domain (EPD)
The EPD is simply the sample domain corresponding to the W-system of coordinates. Thus, straight lines parallel to the w3-axis are drawn through the smallest and largest w1, respectively, and lines parallel to the w1-axis are drawn through the smallest and largest w3. Here again generalization is readily made to a p -dimensional W -space. The EPD for the firefly data is also shown in figure 2.
The interpretation of EPD is straightforward. Unlike the sample domain in either the X-system or the Z-system, the EPD excludes points that are distant from the cluster of regressor points. This has two advantages. In the first place, the use of the regression equation is justified for all points inside, and on the periphery of the EPD. And accordingly, the variance of the predicted value for any such point will not be unduly large. These statements require more detailed treatment. To this effect we introduce the concept of variance factor (VF).
4. The Variance Factor (VF)
From regression theory we know that the variance of any linear functon, say L, of the coefficient estimates is of the form:
| (12) |
where is the variance of the experimental errors ϵ of the y measurements. The multiplier f(X) is independent of the y and depends only on the X matrix and on the coefficients in the L function. We call this multiplier the variance factor, VF.
Thus, we have:
| (13) |
and
| (14) |
In eq (14), is the estimated, or predicted y value at any chosen point in X-space. is of course a function of the location of this point.
Returning now to our statements above, it is well-known that a regression equation can show excellent (very small) residuals and yet be very poor for certain prediction purposes. The small residuals merely mean that a good fit has been obtained at the points used in the experiment. This is no guarantee that the fit is good at other points. However, if the regression equation is scientifically reasonable, it is likely that the experimental situation underlying it will also be valid for points that are close to the cluster of the regressor points used in the experiment. Every point in the EPD satisfies this requirement.
Furthermore, the variance of prediction, measured by the VF, will also be reasonably small for all points of the EPD, simply because they are geometrically close to the design points.
The calculation of is quite simple, once the V-matrix and the λ values have been calculated. It is based on the equation
| (15) |
where uk is defined as:
| (16) |
Combining eqs (8) and (16), we obtain
| (17) |
and hence:
| (18) |
Figure 3 shows the VF values at the vertices of the original sample domain and of the EPD. Interpreting these results, we see that the collinearity of our data is reflected in the rejection of an appreciable portion of the sample domain for purposes of safe prediction. This does not mean that prediction outside the EPD is impossible, or unacceptable. It merely means that such prediction cannot be justified on the basis of the data alone. Of course, the risk of predicting outside the EPD increases with the distance from the EPD. It will generally be reasonably safe to use the regression equation even outside the EPD, as long as the point for which prediction is made is reasonably close to the borders of the EPD. Using eq (18), the VF for any contemplated prediction point is readily calculated and can serve as a basis for decision.
Figure 3—
VF at vertices of sample domain and of EPD.
Second Example:
Calibration for Protein Determination
The instructive and intuitively satisfying graphical display of the EPD becomes impossible when the number of regressors, including the independent term, exceeds 3. We must then replace the graphical procedure by an analytical one, as will now be shown in the treatment of our second example.
The data were presented by Fearn [3], in a discussion of Ridge Regression. They represent the linear regression of percent protein, in ground wheat samples, on near-infrared reflectance at six different wavelengths.
For reasons of simplicity in presentation, we include here only three of the six wavelengths, a change that has a rather small effect on the final outcome of the analysis: it turns out that the regression equation based on these 3 wavelengths is very nearly as precise as that based on 6 wavelengths.
The data, displayed in table 5, are a very good example of the use of regression equations: the regression equation is indeed to be used as a “calibration curve” for the analysis of protein, using the rapid spectrometry instead of the far more time-consuming Kjeldahl nitrogen determination. Our data have an N value of 24, and p (including the independent term) is 4.
Table 5.
Protein Calibration Data(*)
| Reflectance | % Protein | |||
|---|---|---|---|---|
| Point | x2 | x3 | x4 | y |
| 1 | 246 | 374 | 386 | 9.23 |
| 2 | 236 | 386 | 383 | 8.01 |
| 3 | 240 | 359 | 353 | 10.95 |
| 4 | 236 | 352 | 340 | 11.67 |
| 5 | 243 | 366 | 371 | 10.41 |
| 6 | 273 | 404 | 433 | 9.51 |
| 7 | 242 | 370 | 377 | 8.67 |
| 8 | 238 | 370 | 353 | 7.75 |
| 9 | 258 | 393 | 377 | 8.05 |
| 10 | 264 | 384 | 398 | 11.39 |
| 11 | 243 | 367 | 378 | 9.95 |
| 12 | 233 | 365 | 365 | 8.25 |
| 13 | 288 | 415 | 443 | 10.57 |
| 14 | 293 | 421 | 450 | 10.23 |
| 15 | 324 | 448 | 467 | 11.87 |
| 16 | 271 | 407 | 451 | 8.09 |
| 17 | 360 | 484 | 524 | 12.55 |
| 18 | 274 | 406 | 407 | 8.38 |
| 19 | 260 | 385 | 374 | 9.64 |
| 20 | 269 | 389 | 391 | 11.35 |
| 21 | 242 | 366 | 353 | 9.70 |
| 22 | 285 | 410 | 445 | 10.75 |
| 23 | 255 | 376 | 383 | 10.75 |
| 24 | 276 | 396 | 404 | 11.47 |
x1 = 1
Table 6 exhibits the correlation matrix of the 24 design points. It is very apparent that the x values at all three wavelengths are highly correlated with each other, thus indicating a high degree of collinearity. At a first glance one would be very skeptical about such a set of data, and suspect that the X matrix shows such a high degree of redundancy as to make the regression useless for prediction purposes. Fearn explains that the correlations are more a reflection of particle size variability than of protein content. Our analysis will confirm that, properly interpreted, the data lead to a very satisfactory calibration procedure.
Table 6.
Protein calibration data—correlation matrix of x1 through x4.
| 1 | 0 | 0 | 0 |
| 1 | .9843 | .9337 | |
| 1 | .9545 | ||
| 1 |
We will find it useful to introduce a slightly different Z transformation, which we call the Range-Midrange Transformation.
The Range-Midrange Transformation
The Range-Midrange Transformation (RMT) is defined as follows:
| (19a) |
| (19b) |
but now Cj is defined as the midrange of the N values of xj and Rj is one-half the range of these values. With these definitions, it is clear that the smallest z-value, for any regressor, is (−1) and the largest z-value is (+1). It is because of this −1 to +1 scale that this transformation was introduced. The benefits of this scale will become apparent in the following section.
EPD for the Protein Data
The EPD resulting from the Singular Value Decomposition based on the Range-Midrange Transformaton will not be he same as the EPD we would have obtained using the Correlation Scale Transformation, but we will see that those features of the EPD that are of importance for us, in establishing the limitations of the regression equation, are practically unaffected.
Table 7 shows the C and R values for the four regressors and table 8 exhibits the V matrix and the λ values obtained from the Singular Value Decomposition. The latter, it may be recalled, simply expresses the rotation of the Z coordinate system to the W system.
Table 7.
Protein calibration data—parameters for Z transformation (RMT).
| j | C | R |
|---|---|---|
| 1 | 0 | 1 |
| 2 | 296.5 | 63.5 |
| 3 | 418.0 | 66.0 |
| 4 | 432.0 | 92.0 |
Table 8.
Protein calibration data—V matrix and λ values (RMT).
| k | 1 | 2 | 3 | 4 | λ |
|---|---|---|---|---|---|
| 1 | −.6665 | .4845 | .4217 | .3784 | 43.7810 |
| 2 | .7365 | .3299 | .3797 | .4523 | 8.3782 |
| 3 | −.1096 | −.5491 | −.2509 | .7896 | .3758 |
| 4 | −.0332 | −.5958 | .7843 | −.1698 | .06624 |
For each wk coordinate, there are 24 values, corresponding to the 24 regressor points.
Table 9 shows the smallest and the largest wk value, for each of the four k.
Table 9.
Protein calibration data—limits defining the EPD.
| Coordinate (k) | Smallest w | Largest w |
|---|---|---|
| 1 | −1.9282 | .6181 |
| 2 | −.4097 | 1.8989 |
| 3 | −.1669 | .3158 |
| 4 | −.0801 | .1324 |
According to table 9, we must have, in the EPD:
| (20) |
with similar statements for w2, w3, and w4. Applying now eq (8), this double inequality can be written:
Since z1 is constant and =1, this double inequality becomes:
| (21a) |
With the RMT, the value of any zk is, for any k > 1, between ( −1) and (+1). Thus the expression in the middle has, for all design points, a value between −1.2846 and 1.2846, where 1.2846 is the sum of the absolute values of the three coefficients. Therefore, the double inequality expressed by eq (21a) holds, essentially, for every point in the original sample domain. Thus, w1, the first coordinate of the EPD, which represents its largest dimension, imposes essentially no restrictions on the sample domain.
Doing the same calculations for the three other w-coordinates (see table 9), we obtain, respectively:
| (21b) |
| (21c) |
| (21d) |
We see that w2 too, imposes only very light restrictions on the sample domain. On the other hand, w3 and w4 do imply limitations that eliminate appreciable portions of the sample domain from the EPD.
We could readily convert eqs (21c) and (21d) to x coordinates by means of table 7 and eqs (19a) and (19b), but the z-coordinates, using the Range-Midrange Transformation, are more readily interpreted in terms of the severity of collinearity than the x-coordinates.
Thus, the sum of the absolute values of the coefficients in the middle terms of (21c) and (21d) are 1.5896 and 1.5499, respectively. Points for which these linear combinations take the valves ±1.5896 and ±1.5499 exist in the original sample domain. The EPD, on the other hand, limits these functions to intervals with much narrower limits.
Effect of Type of Z Transformation
We have used two different Z transformations, the Correlation Scale, and the Range-Midrange. It is proper to ask how our results would have been affected in the Protein Calibration Data, had we used Correlation Scale, instead of the Range-Midrange Transformation. We show the comparison in table 10. Let us recall that with the CST, one of the w coordinates yields a λ-value of unity, and a constant w value for all points. Therefore, we obtain for CST, only three sets of inequalities, as compared to the four sets for RMT. To allow the comparison between the two transformation to be made, we have multiplied eqs (21a) through (21d) by positive constants, so as to make the coefficient of z4 equal to ± 1. The same was done for the corresponding inequalities obtained by the Correlation Scale Transformation.
Table 10.
Protein calibration data—effect of Z transformation.1
| w coordinate | z Transf. | Inequalities |
|---|---|---|
| 1 | CST | −3.034≤1.021 z2+1.061 z3+z4≤3.082 |
| RMT | −3.334≤1.280 z2+1.114 z3+z4≤3.395 | |
| 2 | CST | |
| RMT | −2.534≤.729 z2 + .840 z3+z4≤2.569 | |
| 3 | CST | −.075≤−.686 z2−.321 z3+z4 .535 |
| RMT | −.072≤ −.695 z2−.318 z3+z4≤.539 | |
| 4 | CST | −.278≤−3.531 z2+4.640 z3−z4≤.980 |
| RMT | −.276≤−3.509 z2+4.619 z3−z4≤.975 |
All inequalities are expressed in RMT z coordinates.
Of course, since the z coordinates are different for the two transformations, the inequalities for the CST, expressed in the CST z-units, had to be converted to RMT z-units, for a meaningful comparison. As can be seen from table 10, the two smallest dimensions of the EPD are practically the same for the two transformations. Thus, even though the method of principal components is not invariant with respect to linear transformations of scale, our analysis leads, in this case, to very similar results for the small dimensions of the EPD. We believe that this is generally true for all situations in which collinearity is noticeable, i.e., for all situations in which the EPD eliminates considerable portions of the original sample domain. For situations in which this does not apply, i.e., totally non-collinear cases, the inequalities do not matter, since they impose no restrictions on the sample domain.
It is interesting to contrast the remarkable similarity between the inequalities for w3 and w4 for the two transformations in table 10, with the behavior of a commonly advocated measure of collinearity (Belsley, Kuh, and Welsch [4], the condition-number.
The SVD resulting from the CST yields the following eigenvalues: 2.9151, 1.0000, .07176, .01312. The condition number is defined as the ratio of the largest to the smallest eigenvalue. In this case:
On the other hand, the SVD resulting from the RMT on the same data yields the eigenvalues: 43.7810, 8.3782, .37575, .066244. This time we have:
Thus the condition number varies considerably when the data are subjected to different standardizing transformations. It is not clear what useful information can be derived from the condition number.
By contrast, the treatment of collinearity we advocate has a useful and readily understood interpretation: the EPD is that part of the X space in which, and near which, prediction is safe. It also indicates what portions of the original sample domain are inappropriate for prediction on the basis of the given data alone. It fulfills this function in a way which is practically invariant with respect to intermediate transformations of scale. We use the qualifier “intermediate” because collinearity has meaning only in terms of a given original coordinate system (the X system). This system, which determines the original sample domain, must be considered fixed. On the other hand, transformations of this system prior to calculating the EPD can be defined in different ways without affecting the practical inferences drawn from the data on the basis of the Final EPD derived form the standardizing transformation.
Cross-Validation
We can take advantage of the availability of a second set of protein calibration data, also given in Fearn [3]; to verify the correctness of our approach. Fearn lists 26 additional points for which the reflectance measurements, as well as the Kjeldahl nitrogen determination, were made. We applied the Z transformation obtained above (RMT on first set of 24 points) to each of these 26 points, and noted every point for which at least one of the four sets of inequalities (21a) through (21d) failed to be satisfied. We found 14 such points. This means that 14 “future points” obtained under the same test conditions were outside the EPD established on the basis of the original 24 points. However, as we observed above, as long as the point is not far from the EPD, prediction at that point is likely to be valid. We tested “predictability” at these 14 points by calculating the VF value for each of them, and by comparing the predicted protein value with the measured one. The results are shown in table 11. It is apparent that all VF are relatively small, indicating that even though these 14 points are outside the EPD calculated from the original set, they are not far from that EPD. This is confirmed by the good agreement between the observed and predicted values. The standard deviation of fit for the original set of 24 points was 0.23; the standard deviation for a single measurement derived from the 14 differences in table 11 is 0.30.
Table 11.
Protein calibration data—cross-validation of analysis.
| % Protein | |||
|---|---|---|---|
| Point1 | Observed | Predicted | VF |
| 1 | 8.66 | 9.53 | .281 |
| 4 | 11.77 | 11.97 | .416 |
| 6 | 10.46 | 10.96 | .193 |
| 9 | 12.03 | 11.47 | .212 |
| 10 | 9.43 | 9.54 | .762 |
| 11 | 8.66 | 8.15 | .454 |
| 12 | 14.44 | 13.99 | .881 |
| 14 | 10.41 | 10.17 | .468 |
| 16 | 11.69 | 11.24 | .472 |
| 17 | 12.19 | 11.83 | .390 |
| 18 | 11.59 | 11.39 | .314 |
| 20 | 8.60 | 8.39 | .201 |
| 22 | 9.34 | 8.93 | .151 |
| 26 | 10.89 | 10.94 | .741 |
Point in additional set (Fearn [3]) with its number designation in that set.
Expansion Terms
Quite frequently, a regression equation contains x variables that are non-linear functions of one or more of the other x variables, such as , x2⋅x3, etc. Polynomial regressions are necessarily of this type. Since the x variables are non-stochastic in the usual regression models, the least squares solution for the regression equation is not affected by the presence of such “expansion terms.” On the other hand, collinearity can be introduced, or removed, or modified by them.
In our treatment the expansion terms cause no additional problems. Consider for example, the regression
| (22) |
with x1 ≡ 1.
Here we have p =3. Using RMT, followed by a singular value decomposition, we obtain an EPD of three dimensions, leading to the inequalities.
| (23) |
Expressing the w as functions of the z, this leads to three double inequalities governing the z, of the form
| (23) |
Now, since , we have
Hence:
| (24) |
Because of this relation the functions f1(z), f2(z), f3(z) become functions of z1, z2 (and ) only. Using this fact, we interpret the three sets of inequalities (23) exactly as we have interpreted eqs (21a) through (21d) by determining which of these inequalities, if any, impose restrictions on the use of the original sample domain.
To illustrate this procedure, consider the small set of artificial data shown in table 12, for which the model is given at the bottom of the table. The term introduces a high correlation between x2 and x3 and consequently also considerable collinearity.
Table 12.
An artificial quadratic example1.
| Point | x2 | x3 | y |
|---|---|---|---|
| 1 | .2 | .04 | 28.3 |
| 2 | .4 | .16 | 27.5 |
| 3 | 1 | 1.00 | 25.6 |
| 4 | 2.1 | 4.41 | 2S.7 |
| 5 | 3.6 | 12.96 | 46.4 |
| 6 | 4.7 | 22.09 | 69.8 |
y = β1x1 + β2x2 + β3x3 + ϵ; β1=30, β2=8, β3=3.5, σe=0.2 x1 = 1.
Note that .
The inequalities characterizing the EPD based on a Range-Midrange Transformation and converted to the z-scales, are shown in table 13. Applying eq (24) to express z3 in terms of z2, the three double-inequalities become:
Table 13.
Quadratic example—inequalities for EPD.
| w-coordinate | Inequalities |
|---|---|
| w1 | −1.1281≤−.5070 z2+.6214 z3≤1.1284 |
| w2 | −.8541≤.5029 z2+.3511 z3≤.8541 |
| w3 | −.3141≤−.7005 z2+.7008 z3≤.0003 |
It is readily verified that of these six inequalities, all but one are satisfied for all z2 values between −1 and +1. The last one, involving the left side of the third set, is satisfied for all z2 values except for the interval: −.156≤z2≤. 155. This corresponds to an x2 interval between 2.1 and 2.8, or between the design points x2=2. 1 and x2=3.6 (see table 12). The interpretation of this finding is that while all design points are of course inside the EPD, a small portion of the curve versus x2 falls slightly outside the EPD. This is of no practical significance since the VF for these points, even though they are outside the EPD, does not exceed 0.58. By comparison, the smallest VF value along the curve, for the range x2=.2 to x2=4.7, is of the order of 0.26. Thus we see that the serious collinearity in this data set is merely a consequence of the presence of the expansion term .
Any point in X space, in order to be acceptable, must lie on the curve . An x3 with any other value is obviously not valid and our analysis of the data, through the EPD, calls attention to this fact: in the direction of w3, the width of the EPD is only .31 as compared with widths of 2.26 and 1.71 for w1 and w2.
Discussion
The common mathematical definition of collinearity is the existence of at least one linear relation between the x’s, of the form
| (25) |
where the cj are not all zero, and such that eq (25) holds with the same cj values, for all i. This defines what we shall call “exact collinearity.” Geometrically, it means that all design points lie in an hyperplane of the x-space, going through the origin of the coordinate system. Equation (25) also implies that the matrix X’X is singular, and consequently that the estimates of the β coefficients are not uniquely defined.
Exact collinearity seldom occurs in real experimental situations; indeed, if the X matrix is not the result of a designed experiment, it is highly improbable that a relation such as eq (25) would hold exactly. If, on the other hand, the experiment is designed, care would generally have been taken to avoid a situation of exact collinearity.
While exact collinearity is practically of little concern, near-collinearity is a frequent occurrence in real-life data. This occurs when an equation such as (25) is “approximately” true for all i. Many attempts have been made to define more closely the concept of near-collinearity, but while these endeavors have led to a number of proposals for measuring collinearity, they are of little practical use to the experimenter confronted with the task of interpreting his data.
It is not our intention to discuss here the pros and cons of the various attempts made by a number of authors to “remedy” a near-collinear situation. The best-known of these remedial procedures is Ridge Regression. We merely repeat what we have said in the body of the paper: any attempt to remedy collinearity must necessarily be based on additional assumptions, unless it consists of making additional measurements. The latter alternative is of course logical and valid, but the making of assumptions invented specifically for the purpose of removing collinearity does not appear to us to be a recommendable policy in data analysis.
One easily recognizable condition leading to collinearity is the existence of at least one high correlation coefficient among the non-diagonal elements of the correlation matrix of the x’s. This has given rise to the concept of the Variance Inflation Factor (VIF). The VIF for is defined (Draper and Smith [5]), as:
| (26) |
where Rj is the multiple correlation coefficient of xj on all other regressors. If d represents a residual in this regression, the usual formula for Rj is given by
| (27) |
Now, Snee and Marquardt (Belsley [6], “comments”) make, implicitly, a distinction between the two “models”:
| (28a) |
with x1≡1, and
| (28b) |
where (28b) is called the “centered” model. For (28b), Snee and Marquardt use eq. (27), but for (28a) they appear to use the definition:
| (29) |
Equation 29, in which the denominator of the last term is not centered, is not explicitly given by Snee and Marquardt, but is implied by their statement:
“If the domain of prediction includes the full range from the natural origin through the range of the data, then collinearity diagnostics should not be mean-centered,” and confirmed by the VIF values given in their table 1. In this table, “no centering” results in VIF values of 200,000 and 400,000, while the VIF for the “centered” data are unity. The quoted statement occurs in a section entitled “Model building must consider the intended or implied domain of prediction.” The basic idea underlying the section in question is that the analysis of the data, based on the “collinearity diagnostics” (specifically: the VIF values), is goverened by the location of the points were one wishes to make predictions and, more specifically, on whether the origin (x1 = 1, x2=x3⋅⋅⋅=0) is such a point. The VIF values which, according to Snee and Marquardt’s formulas, depend heavily on whether or not this origin is included, will then indicate the quality of the predicted values.
A more reasonable approach, and one more consistent with the procedures commonly used by scientists, is to limit prediction to the vicinity of where one made the measurements, unless additional information is available that justifies extrapolation of the regression equation to more distant points of the samples space. The vicinity of the measured points is determined by the EPD which, in the case of collinearity, may be considerably smaller than the sample domain. In this view, it is the location of the design points, rather than that of the intended points of prediction, that determines predictability. The latter is measured, not by VIF values, but rather by the more concrete VF values, for any desired point of prediction.
The view advocated by Snee and Marquardt sometimes results in an enormous difference in the VIF values between the centered and non-centered forms. Equation 29 serves no useful purpose and is, in fact, unjustified and misleading. It is unjustified because it not only includes the origin (x1 = 1, xk = 0 for k > 1) in the correlation and VIF calculations, but moreover, gives this point infinite weight in these calculatons. Yet, no measurement was made at that point. Equation 29 is also misleading because it leads to very large VIF values for some non-centered regressions, implying that severe “ill-conditioning” exists, even when the X matrix is except for some trivial coding, completely orthogonal (cf. [6]).
The ill-conditioning exists only in terms of the large VIF value. It is an artifact arising from the desire to make the two forms of the regression equation into two distinct “models”.
The two forms, eqs 28a and 28b lead to identical estimates for the βj, including β1, and for their standard errors. They also lead to identical values and variances for an estimated (predicted) , at any point of the X space. There seems to be no valid reason for the two distinct equations for the VIF. They only lead to the false impression that centering can reduce or even remove collinearity.
Our viewpoint in this paper is that the usefulness of a regression equation lies in its abilty to “predict” y for interesting combinations of the x’s. We also take the position that inferences from the data alone should be confined to x points that are in the general geometric vicinity of the cluster of design points. An inference for points that are well outside this domain (i.e., outside a suitably defined EPD) is, in the absence of additional information, only a tentative conclusion, and not a valid scientific inference. Such conclusions may however, be very useful, provided their tentative character is recognized, and provided they are subsequently subjected to further experimental verification.
Daniel and Wood [7] discuss briefly the relation between the variance of and the location of the point at which the prediction is made. However, their discussion is in the context of selecting the best subset of regressors from among the entire set of regressors, a subject different from the one dealt with in this paper.
Another publication that deals explicitly with predictability is a paper by Willan and Watts [8]. These authors define a “Region of Effective Predictability” (REPA) as that portion of the X space in which the variance of the predicted does not exceed twice the variance of predicted at the centroid of the X matrix. The volume of the region is then compared with that of a similarly defined REP, denoted REPo. The latter refers to a “fictitious orthogonal reference design” of “orthogonal data with the same N and the same rms values as the actual data.” The ratio of the volume of REPA to that of REPo is taken as “an overall measure of the loss of predictability volume due to collinearity”.
This concept, apart from its artificial character, suffers from other shortcomings. Like so many other treatments, it attempts to provide a measure of collinearity. But the practitioner who is confronted with a collinear X matrix does not need a measure of collinearity: he needs a way to use the data for the purpose for which they were obtained. Furthermore, this measure loses its meaning when expansion variables are present. For example, for the artificial quadratic set of table 12, Willan and Watts’ measure would indicate a high degree of collinearity which, while literally true, is totally misleading since the collinearity in no way reduces the usefulness and predicting power of the regression equation, as long as the meaning of the expansion term is taken into account. But even in cases without expansion terms, the measure in question may be misleading. Thus when applied to the protein calibration data of table 5, it may well lead the analyst to give up on these data as a hopelessly highly-collinear set, whereas, as we have seen, there is nothing wrong with this set and it can indeed be used very effectively for the calibration of a method for protein determination based on reflectance measurements.
Finally, a few words about estimating the β-coefficients considered as rates of change of y with changes in the individual xj. As pointed out by Box [9], this is generally not a desirable use of regression equations. If, however, it is the major purpose of a ‘particular experiment, then this experiment should be designed accordingly, which means: essentially with an orthogonal X matrix. A collinear X matrix leads to the ability to estimate certain linear combinations of the β’s much better than the β’s themselves. The experimenter can calculate the VF values, not only for any point of X space, but also for any β or combination of β’s, and he can do this without making a single measurement, i.e., in the planning stages of the experiment. If the experimenter does not take advantage of this opportunity, he may be in for considerable disappointment, after having spent time, money, and effort on inadequate experimentation. We believe that he advocacy of remedial techniques, such as Ridge Regression for collinear data is unwise. One of the most important tasks of a data analyst is to detect, and to call attention to, limitations in the use and interpretation of the data.
Biography
About the Author: John Mandel is a statistical consultant serving with NBS' National Measurement Laboratory.
Footnotes
Figures in brackets indicate literature references.
We assume that in the X-system, the regressor x1 is identically equal to unity, to allow for an independent term.
References
- [1].Buck J.B., Studies on the Firefly, Part I: The Effects of Light and Other Aspects on Flashing in Photinus Pyralic, with Special Reference to Periodicity and Diurnal Rhythm, Physiological Zoology, 10, 45–58 (1937). [Google Scholar]
- [2].Mandel J., Use of the Singular Value Decomposition in Regression Analysis. The American Statistician, 36, 15–24 (1982). [Google Scholar]
- [3].Fearn T., A Misuse of Ridge Regression in the Calibration of a Near Infrared Reflectance Instrument, Applied Statistics, 32, 73–79 (1983). [Google Scholar]
- [4].Belsley D.A., Kuh E., and Welsh R.E., Regression Diagnostics: Identifying Influential Observations and Sources of Collinearity, Wiley, NY: (1980). [Google Scholar]
- [5].Draper N.R. and Smith H., Applied Regression Analysis, Wiley, 2nd Edition, NY: (1981). [Google Scholar]
- [6].Belsley D.A., Demeaning Conditioning Diagnostics Through Centering, The American Statistician, 38, 73–77, and “Comments”, 78–93 (1984). [Google Scholar]
- [7].Daniel C., and Wood F.S., Fitting Equations to Data, Wiley, 2nd Edition, NY: (1980). [Google Scholar]
- [8].Willan A.R. and Watts D.G., Meaningful Multicollinearity Measures. Technometrics, 20, 407–12 (1978). [Google Scholar]
- [9].Box G.E.P., Use and Abuse of Regression, Technometrics 8 625–29 (1966). [Google Scholar]



